Closed Bug 1244749 Opened 9 years ago Closed 3 months ago

Necko Hang in nsSocketTransportService::Poll after wakeup

Categories

(Core :: Networking, defect, P3)

Unspecified
Windows
defect
Points:
7

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: mcmanus, Unassigned)

References

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file, 4 obsolete files)

Boris sent us a log which shows SocketTransportService::Poll() not returning shortly after a laptop resumes from sleep. The actual Poll() iteration that happens during the sleep does complete after wakeup, but the next one does not. The logging isn't fine enough to know if this happens in the system poll() or in the gecko Poll().

We can work with the reporter to try and recreate and learn more

:bagder is going to lead.
Assignee: nobody → daniel
Here's a patch that adds lots of debugging logs in order to help tracking this down. It also forcible enables logging to a file in C:\temp or /tmp named firefox-helpful-log.txt. It uses this log set: "timestamp,append,poll-log:5,nsNotifyAddr:5,nsIOService:5" (where poll-log is the new POLL_LOG lines this patch adds in socket transport).

In addition to some extra checking of the pollable event that should allow us to get more info about what's actually going on, this code also has some added behavior controlled by the 'network.poll.debug' pref, which isn't created by this patch but it will read it if its set by the user.

The general idea being that the logging in itself should give us a very good idea of what is really happening, and after possibly some further analyzing we MIGHT be able to figure out/test more by altering the pref in a subsequent test - but it still relieves the testing person from having to download another firefox install.

network.poll.debug has these different additional (to what is described above) functions:

1 - makes nsIOService to setoffline on the SLEEP event and switch off offline again on the WAKEUP event

2 - makes socket transport explicitly kill the pollable event on SLEEP and clear the file descriptor to cause a busy loop until it is recreated. When set to 2, the socket transport service will act on network-change/network-up events and then create a new pollable event if previously blanked out.

3 - makes socket transport verify and possibly repair the pollable event on every received network change event
Flags: needinfo?(mcmanus)
Flags: needinfo?(dd.mozilla)
Comment on attachment 8714764 [details] [diff] [review]
0001-bug-1244749-add-logging-and-repair-attepts-on-networ.patch

Review of attachment 8714764 [details] [diff] [review]:
-----------------------------------------------------------------

{mThreadEvent = PR_NewPollableEvent();
mPollList[0].fd = mThreadEvent;
mPollList[0].out_flags = 0;}

should probably be made its own method that also holds the lock and afterwards calls setpollableevent() to cover the case where there was a event pending on the old pipe

I would like to see some logging in nspr - especially on error cases reading/writing..

::: netwerk/base/nsIOService.cpp
@@ +1412,5 @@
> +        debug) {
> +        if (!mSocketTransportService)  {
> +            LOG(("NotifyWakeup: socket transport is NULL"));
> +        } else {
> +            if (1 == debug)

nit braces

@@ +1486,5 @@
> +        nsCOMPtr<nsIPrefBranch> prefs;
> +        GetPrefBranch(getter_AddRefs(prefs));
> +        if (NS_SUCCEEDED(prefs->GetIntPref(NECKO_POLL_DEBUG, &debug)) &&
> +            debug) {
> +            if (1 == debug)

int braces

::: netwerk/base/nsSocketTransportService2.cpp
@@ +32,5 @@
>  
>  using namespace mozilla;
>  using namespace mozilla::net;
>  
> +LazyLogModule gSocketTransportLog("poll-log");

why is this changed instead of adding a new logmodule?

@@ +445,5 @@
>  }
>  
> +void
> +nsSocketTransportService::RepairPollable()
> +{

assert socket thread

@@ +456,5 @@
> +
> +        set = PR_WaitForPollableEvent(mThreadEvent);
> +        if (set != PR_SUCCESS) {
> +            POLL_LOG(("RepairPollable: pollable event is broken, attempt fix\n"));
> +            PR_DestroyPollableEvent(mThreadEvent);

log the return code here (assuming there is one)

@@ +1383,5 @@
> +                                data8.get()));
> +                      if (mThreadEvent) {
> +                          POLL_LOG((" thread event already up\n"));
> +                      } else {
> +                          DebugMutexAutoLock lock(mLock);

I think in mode 2 this replacement threadEvent should not get created until things have settled down.. doing it immediately when we know there are more events to come sounds fairly close to current behavior..

so I offer two paths.. perhaps its 2 modes
1] a timer
2] just do a destroy (if necessary) and create cycle on each LINK_DATA event.. so we would go through several cycles of this.

@@ +1384,5 @@
> +                      if (mThreadEvent) {
> +                          POLL_LOG((" thread event already up\n"));
> +                      } else {
> +                          DebugMutexAutoLock lock(mLock);
> +                          mThreadEvent = PR_NewPollableEvent();

this is main thread.. but mThreadEvent can't be assigned on the main thread - the socket thread could be listening to it in waitforpollableevent. I suppose wait4 could hold the lock as its not supposed to block
Flags: needinfo?(mcmanus)
Comment on attachment 8714764 [details] [diff] [review]
0001-bug-1244749-add-logging-and-repair-attepts-on-networ.patch

Review of attachment 8714764 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsSocketTransportService2.cpp
@@ +176,4 @@
>  
>      nsCOMPtr<nsIThread> thread = GetThreadSafely();
>      nsresult rv;
>      rv = thread ? thread->Dispatch(event_ref.forget(), flags) : NS_ERROR_NOT_INITIALIZED;

Add log here. also log rv value.

@@ +457,5 @@
> +        set = PR_WaitForPollableEvent(mThreadEvent);
> +        if (set != PR_SUCCESS) {
> +            POLL_LOG(("RepairPollable: pollable event is broken, attempt fix\n"));
> +            PR_DestroyPollableEvent(mThreadEvent);
> +            mThreadEvent = PR_NewPollableEvent();

also here, log mThreadEvent

@@ +529,5 @@
> +        } else {
> +            DebugMutexAutoLock lock(mLock);
> +            POLL_LOG(("Found no mThreadEvent, make one!\n"));
> +            mThreadEvent = PR_NewPollableEvent();
> +            mPollList[0].fd = mThreadEvent;

can we log mThreadEvent here
Flags: needinfo?(dd.mozilla)
(In reply to Patrick McManus [:mcmanus] from comment #2)

> I would like to see some logging in nspr - especially on error cases
> reading/writing..

Hm, is there a good trick to get the nspr logging to get into the general log? The nspr code is plain C and has its own set of logging functions etc, and I just think it would be unfortunate to have to log those in a separate file.

I have most of the other comments from you and Dragana addressed now, so I'll upload that patch in a sec.
if you add io:5 to your list. I think it is not too much logging.
(minor changes in functionality, one more debug mode added)

Here's a patch that adds lots of debugging logs in order to help tracking this down. It also forcible enables logging to a file in C:\temp or /tmp named firefox-helpful-log.txt. It uses this log set: "timestamp,append,poll-log:5,nsNotifyAddr:5,nsIOService:5" (where poll-log is the new POLL_LOG lines this patch adds in socket transport).

In addition to some extra checking of the pollable event that should allow us to get more info about what's actually going on, this code also has some added behavior controlled by the 'network.poll.debug' pref, which isn't created by this patch but it will read it if its set by the user.

The general idea being that the logging in itself should give us a very good idea of what is really happening, and after possibly some further analyzing we MIGHT be able to figure out/test more by altering the pref in a subsequent test - but it still relieves the testing person from having to download another firefox install.

network.poll.debug has these different additional (to what is described above) functions:

1 - makes nsIOService to setoffline on the SLEEP event and switch off offline again on the WAKEUP event

2 - makes socket transport explicitly kill the pollable event on SLEEP and clear the file descriptor to cause a busy loop until it is recreated. When set to 2, the socket transport service will act on network-change/network-up events and then create a new pollable event (in the next call to Poll()) if previously blanked out.

3 - makes socket transport verify and possibly repair the pollable event on every received network change event

4 - like [2] but instead of creating the event at the next Poll() after the UP/CHANGED event, it sets a 3 seconds timeout and asks for the pollable event to get created after that expires
Attachment #8714764 - Attachment is obsolete: true
(In reply to Daniel Stenberg [:bagder] from comment #4)
> The nspr code is plain C 

Just a note logging has moved here: xpcom/base/Logging.cpp
there's a try run for you.. I had to resolve a conflict in logging.cpp so you should double check it.
OS: Unspecified → Windows
Depends on: 662330
Whiteboard: [necko-active]
Comment on attachment 8715752 [details] [diff] [review]
v2-0001-bug-1244749-add-logging-and-repair-attepts-on-net.patch

Review of attachment 8715752 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsSocketTransportService2.cpp
@@ +476,5 @@
> +                      (void *)mThreadEvent, (int)rv));
> +            mPollList[0].fd = mThreadEvent;
> +            mPollList[0].out_flags = 0;
> +        }
> +    } while (++retry < 3);

I don't think we need to loop.

if this does fail we should go back to the !mThreadEvent state

@@ +539,5 @@
> +    POLL_LOG(("    ...returned %d after %i milliseconds\n",
> +               rv,
> +         PR_IntervalToMilliseconds(passedInterval)));
> +
> +    if ( PR_IntervalToMilliseconds(passedInterval) > 10000) {

I think this can stay.. at first I waffled. opinions?

@@ +657,5 @@
>          // else wait for Poll timeout
>      }
>  
> +    if (mCreateEventTimer) {
> +        mCreateEventTimer->Cancel();

mcreateeventtimer = nullptr

@@ +1371,5 @@
>          }
>  
>          return net::NetworkActivityMonitor::Init(blipInterval);
> +    } else if (!strcmp(topic, NS_TIMER_CALLBACK_TOPIC)) {
> +        // there is only one timeout that can trigger here

nonetheless make the if check that subject is the timer you expect

@@ +1373,5 @@
>          return net::NetworkActivityMonitor::Init(blipInterval);
> +    } else if (!strcmp(topic, NS_TIMER_CALLBACK_TOPIC)) {
> +        // there is only one timeout that can trigger here
> +        POLL_LOG(("Observe: timeout triggered\n"));
> +        mCreateThreadEvent = true; // please setup in next Poll() call

timer = nullptr

@@ +1402,5 @@
> +                          } else {
> +                              // ask for it to get created in 3 seconds
> +                              POLL_LOG((" network-change triggered create timer\n"));
> +                              if (!mCreateEventTimer) {
> +                                  mCreateEventTimer = do_CreateInstance("@mozilla.org/timer;1");

else cancel()

@@ +1420,5 @@
> +                  break;
> +                }
> +            }
> +        }
> +    } else if (!strcmp(topic, NS_WIDGET_SLEEP_OBSERVER_TOPIC)) {

cancel and null timer in this path

::: netwerk/base/nsSocketTransportService2.h
@@ +138,5 @@
>                              // whenever they access mThreadEvent.
>      bool        mAutodialEnabled;
>                              // pref to control autodial code
>  
> +    bool mCreateThreadEvent; // set 'true' to make Poll() create mThreadEvent

needs to be atomic to be used on multiple threads
we've lost our repro, but I think we ought to make landing some of this patch a high priority given the amount of analysis that has gone into it.

obviously we want to get rid of any non NSPR logging. And having multiple modes isn't really helpful - so the debug pref can go.

do you folks think there is value here too?
Flags: needinfo?(dd.mozilla)
Flags: needinfo?(daniel)
(In reply to Patrick McManus [:mcmanus] from comment #11)

> obviously we want to get rid of any non NSPR logging. And having multiple
> modes isn't really helpful - so the debug pref can go.
> 
> do you folks think there is value here too?

I think there's a genuine problem hiding in there somewhere that we want to fix, so yes having some extra logging for the next time someone has a decent way to reproduce sounds really good.

Do you think we should keep the extra logging in its own log module like in the current patch iteration or should we just bump it up a log level and keep it sockettransport ?
Flags: needinfo?(daniel) → needinfo?(mcmanus)
keep the logging in the normal places.. and this patch will do more than just log, it will verify and repair and we should keep that in as a belt-and-suspenders measure.
Flags: needinfo?(mcmanus)
Look at the comment and the stack:

https://crash-stats.mozilla.com/report/index/70131cbd-2e37-4ecc-9ac6-863502160209#allthreads

we should land this.
Flags: needinfo?(dd.mozilla)
agreed.. what do you think the eta for updates is?
Flags: needinfo?(daniel)
(In reply to Patrick McManus [:mcmanus] from comment #15)
> agreed.. what do you think the eta for updates is?

I'll post a refreshed version of this patch asap. Later today ideally.
Flags: needinfo?(daniel)
v3. Somewhat extended logging, repair pollable event logic. Some minor trailing whitespaces were removed which are annoying but hopefully not too disturbing (but I'd like to get rid of them).

The 10 second poll check is still there. The entire debug mode stuff is cut out. The logging is now only using the SOCKET_LOG() like the rest of the code.
Attachment #8715752 - Attachment is obsolete: true
Attachment #8718298 - Flags: review?(mcmanus)
Comment on attachment 8718298 [details] [diff] [review]
v3-0001-bug-1244749-add-logging-and-pollable-event-repair.patch

Review of attachment 8718298 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsSocketTransportService2.cpp
@@ +452,5 @@
> +                    (void *)mThreadEvent));
> +        mPollList[0].fd = mThreadEvent;
> +        mPollList[0].out_flags = 0;
> +    }
> +

if (!mThreadEvent) {
 return;
}

(allocations of this can fail) .

@@ +455,5 @@
> +    }
> +
> +    set = PR_SetPollableEvent(mThreadEvent);
> +    SOCKET_LOG(("RepairPollable: set pollable event: %d\n", (int)set));
> +

if (set == PR_SUCCESS) {
 set = PR_Wait4..
}

@@ +464,5 @@
> +      PRStatus rv = PR_DestroyPollableEvent(mThreadEvent);
> +      mThreadEvent = PR_NewPollableEvent();
> +      SOCKET_LOG(("RepairPollable: pollable recreated [%p] %d\n",
> +                  (void *)mThreadEvent, (int)rv));
> +      mPollList[0].fd = mThreadEvent;

same failure case considerations as before.. but in the success case
I think we want a PR_SetPollableEvent() call here.. in case we just destroyed an event that should have had something pending on it.

@@ +1193,5 @@
> +                    if (rv != PR_SUCCESS) {
> +                        SOCKET_LOG(("DoPollIteration: destroy pollable: %d\n",
> +                                    (int)rv));
> +                    }
> +                    mThreadEvent = nullptr;

doesn't mPollList[0] need to be set to null?

@@ +1198,2 @@
>                  }
> +                RepairPollable();

I think this needs to be called atomically under the same lock that sets threadevent to null.. otherwise lose notifications from another thread during the lock/unlock/lock transition.. so RepairPollable probly needs to be changed to require the caller to get the lock before calling (and assert that's true).

::: netwerk/base/nsSocketTransportService2.h
@@ +18,5 @@
>  #include "mozilla/Mutex.h"
>  #include "mozilla/net/DashboardTypes.h"
>  #include "mozilla/Atomics.h"
>  #include "mozilla/TimeStamp.h"
> +#include "nsITimer.h"

where did this functionality around wakeups and a settledown period go? This was good functionality I think we want..
Attachment #8718298 - Flags: review?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #18)

> where did this functionality around wakeups and a settledown period go? This
> was good functionality I think we want..

Oh, then I misunderstood you as that was code that only ran in one of the debug modes. I'll bring that part back.

My try-run for the v3 patch also proved something in there is terribly wrong: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8964e85e9ce6&selectedJob=16610109
I just meant that having configurable debug modes won't be that helpful if we're just landing it on the trunk.. but doing that quiescent behavior by default seems right.
v4. Brought back the sleep behavior that shuts down the pollable event on sleep and brings it back on network change events. Also edited according to previous comments.

I'm running this on try now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c3677ab86e3

After my previous patch's major breakdown there I suspect this might not go through entirely clean either.

I assume this will collide somewhat what mcmanus's bug 698882 patch.
Attachment #8718298 - Attachment is obsolete: true
Attachment #8718757 - Flags: review?(mcmanus)
Comment on attachment 8718757 [details] [diff] [review]
v4-0001-bug-1244749-add-logging-and-pollable-event-repair.patch

Review of attachment 8718757 [details] [diff] [review]:
-----------------------------------------------------------------

let's get this merged with the patch for 698882.. I think that will land as soon as try says its ok.. other than the few minor things in this review, it seems ready to go

::: netwerk/base/nsSocketTransportService2.cpp
@@ +453,5 @@
> +    if (!mThreadEvent) {
> +        mThreadEvent = PR_NewPollableEvent();
> +        SOCKET_LOG(("RepairPollable: created event [%p]\n",
> +                    (void *)mThreadEvent));
> +        mPollList[0].fd = mThreadEvent;

I would set in flags here to be POLLIN and POLLEXCEPT. I know that's what it should be, but we're resetting the whole index afterall and are not in a fast path. Best for sanity.

@@ +471,5 @@
> +            mThreadEvent = PR_NewPollableEvent();
> +            SOCKET_LOG(("RepairPollable: pollable event broken [%p] %d\n",
> +                        (void *)mThreadEvent, (int)rv));
> +            mPollList[0].fd = mThreadEvent;
> +            mPollList[0].out_flags = 0;

in_flags again

@@ +476,5 @@
> +        }
> +        if (mThreadEvent) {
> +            // in case the event should have had an event, set one unconditionally
> +            // here
> +            rv = PR_SetPollableEvent(mThreadEvent);

just beware when merging with 698882 that the init state of the new event is already primed

@@ +545,5 @@
>  
> +    if ( PR_IntervalToMilliseconds(passedInterval) > 10000) {
> +        // waited longer than 10 seconds, verify mThreadEvent
> +        DebugMutexAutoLock lock(mLock);
> +        RepairPollable();

I think you only want to do this if mThreadEvent, right?

@@ +659,5 @@
>          // else wait for Poll timeout
>      }
>  
> +    if (mCreateEventTimer) {
> +        mCreateEventTimer->Cancel();

mCreateEventTimer = nullptr
Attachment #8718757 - Flags: review?(mcmanus) → feedback+
(In reply to Patrick McManus [:mcmanus] from comment #22)

> let's get this merged with the patch for 698882.. I think that will land as
> soon as try says its ok.. other than the few minor things in this review, it
> seems ready to go

Right. As soon as you get a version of bug 698882 landed, I'll rebase this one on top of that, get a fresh try-run to see that it looks good and I'll run a few local tests on a windows laptop.

I'll still refresh the patch and update according to your remarks and attach a v5 here as a reference.

> @@ +545,5 @@
> >  
> > +    if ( PR_IntervalToMilliseconds(passedInterval) > 10000) {
> > +        // waited longer than 10 seconds, verify mThreadEvent
> > +        DebugMutexAutoLock lock(mLock);
> > +        RepairPollable();
> 
> I think you only want to do this if mThreadEvent, right?

Good catch, otherwise it would probablyy bring it back sooner as it will probably end up in this condition first before a network-change event arrives.
This is just an interim patch, the updated v4. Bound to be slightly adjusted soon again.
Attachment #8718757 - Attachment is obsolete: true
Attachment #8720730 - Flags: feedback+
Blocks: 1158189
Are there any signs left that we still need this sort of work-around? Otherwise I propose we close this.
Flags: needinfo?(mcmanus)
Flags: needinfo?(dd.mozilla)
(In reply to Daniel Stenberg [:bagder] from comment #25)
> Are there any signs left that we still need this sort of work-around?
> Otherwise I propose we close this.

We could ask Valentin to check if we have a e-mail address of the user in crash in comment 14 and ask him to try the new version?

Or get/buy the the broken computer that we know that has this problem, fix it and try it :)
Flags: needinfo?(dd.mozilla)
there is no particular reason to think this is fixed so we should pursue it imo.
Flags: needinfo?(mcmanus)
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Priority: P1 → P2
Whiteboard: [necko-active] → [necko-next]
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3
Assignee: daniel → nobody
Severity: normal → S3

there's a patch here; it can probably be made to apply without too much problem (that code is fairly stable). Impossible to see the old crash report dragana links to, but I presume this bug still is real

Whiteboard: [necko-next] → [necko-priority-new]
Whiteboard: [necko-priority-new] → [necko-priority-new][necko-triaged]

However, given our current priority, I assume this is not very urgent.
Hence, not moving to any of our priority queues.

Points: --- → 7
Rank: 8
Flags: needinfo?(rjesup)
Whiteboard: [necko-priority-new][necko-triaged] → [necko-triaged]

Code has changed a ton and honza landed fixes that include pollable repairs after things like network changes which should handle this.

Status: NEW → RESOLVED
Closed: 3 months ago
Flags: needinfo?(rjesup)
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: