Closed Bug 1152046 Opened 5 years ago Closed 4 years ago

[Win] close() and connect() on a socket may block for a very long time.

Categories

(Core :: Networking, defect)

All
Windows
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla44
Tracking Status
firefox40 + wontfix
firefox41 + wontfix
firefox42 + wontfix
firefox43 --- wontfix
firefox44 --- affected

People

(Reporter: mayhemer, Assigned: dragana)

References

Details

(Whiteboard: [necko-active])

Attachments

(5 files, 11 obsolete files)

1.66 KB, patch
dragana
: review+
Details | Diff | Splinter Review
21.21 KB, patch
dragana
: review+
Details | Diff | Splinter Review
39.65 KB, patch
dragana
: review+
Details | Diff | Splinter Review
3.27 KB, patch
dragana
: review+
Details | Diff | Splinter Review
5.14 KB, patch
dragana
: review+
Details | Diff | Splinter Review
Honza - what do you think of the theory that this is being caused by an LSP locally installed? Perhaps the LSP is blocking on a non-blocking socket.

here are three reports that look like that is what is happening:

https://crash-stats.mozilla.com/report/index/cd6dc70f-1afc-4596-94b6-8d26c2150404#allthreads
https://crash-stats.mozilla.com/report/index/77bf1520-1759-437b-a209-885922150401#allthreads
https://crash-stats.mozilla.com/report/index/9adc4168-b377-4a7c-801d-546d32150402#allthreads

In each case the Winsock LSP list is considerably more complicated than the winsock lsp list that you find in a crash due to something in gecko (e.g the dispatch deadlock). I took all the samples from XP SP3 just because I thought that might make the LSP list more comparable - but it happens on multiple versions of windows.
If you mean the stack:

KiFastSystemCallRet
NtDeviceIoControlFile
WSPBind
WSPAddressToString
WSPConnect
connect
_PR_MD_CONNECT
SocketConnect

then it looks to me like a pure native call (no LSP involved).  Being in bind() sounds like it tries to find a local interface to bind to.  It may be the same cause as in case of the UDP close(), waiting for an ARP or (more likely) a DHCP response - those are more guesses, not supported by any research now.

It could also be an OEM specific code that misbehaves here...  something very hard to find out :(

there is tho 'rsvpsp.dll' in the LSP list.  I don't have that file on my win7 machine.
Blocks: 1158189
No longer blocks: 1124880
Assignee: nobody → dd.mozilla
rsvpsp.dll is also from Windows (used for some priorities, QoS stuff). (related to comment #2)
Attached file .bug_1152046_v1.patch.swp (obsolete) —
For the most of this crashes socket Thread is doing PR_Connect.

This patch prevents connect during shutdown.

Let's try if it helps.
Attachment #8610603 - Flags: review?(honzab.moz)
What's this 'swp' thing?
Attached patch bug_1152046_v1.patch (obsolete) — Splinter Review
sorry
Attachment #8610603 - Attachment is obsolete: true
Attachment #8610603 - Flags: review?(honzab.moz)
Attachment #8611372 - Flags: review?(honzab.moz)
Comment on attachment 8611372 [details] [diff] [review]
bug_1152046_v1.patch

Review of attachment 8611372 [details] [diff] [review]:
-----------------------------------------------------------------

This is good to have but I don't think it's gonna much help with the root cause.  The delays/blocks in connect()/close() are IMO happening not just after shutdown.  The thing is that we try to shutdown while the socket thread has already been hanging inside connect()/close() for a while.  Maybe that is even a reason people restart the browser - since pages are not loading, socket thread is deadlocked.

There are two reasons this could happen IMO: A/V software interference or an LSP malware.

Some code that detects the time spent in connect()/close() would be good to have.  At least collect telemetry, like time spent in, how often it happens to spent more than some 2 seconds, does it happen after network change, does it happen after shutdown only, whatever would be useful.

If long delays happen all the time (based on telemetry), we could do a test at the start or after any network change we detect and try to do something like manual local binding or even let connect()/close() happen on new threads (similar to what I've made for UDP close.) and maybe warn the user that "your OS performs slow when doing network access" or something...

::: netwerk/base/nsIOService.h
@@ +77,5 @@
>                                      uint32_t flags,
>                                      nsAsyncRedirectVerifyHelper *helper);
>  
>      bool IsOffline() { return mOffline; }
> +    bool IsShuttingDown() { return mShutdown; }

Maybe call this IsShutdown() ?  To make it conforming the name of the member.

::: netwerk/base/nsSocketTransport2.cpp
@@ +1217,5 @@
>      bool isLocal;
>      IsLocal(&isLocal);
>  
> +    if (gIOService->IsShuttingDown()) {
> +      return NS_ERROR_ABORT;

nit: the file has 4 spaces indend
Attachment #8611372 - Flags: review?(honzab.moz) → review+
(In reply to Honza Bambas (:mayhemer) from comment #7)
> Comment on attachment 8611372 [details] [diff] [review]
> bug_1152046_v1.patch
> 
> Review of attachment 8611372 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is good to have but I don't think it's gonna much help with the root
> cause.  The delays/blocks in connect()/close() are IMO happening not just
> after shutdown.  The thing is that we try to shutdown while the socket
> thread has already been hanging inside connect()/close() for a while.  Maybe
> that is even a reason people restart the browser - since pages are not
> loading, socket thread is deadlocked.
> 
> There are two reasons this could happen IMO: A/V software interference or an
> LSP malware.
> 
> Some code that detects the time spent in connect()/close() would be good to
> have.  At least collect telemetry, like time spent in, how often it happens
> to spent more than some 2 seconds, does it happen after network change, does
> it happen after shutdown only, whatever would be useful.
> 


I had intention to add telemetry as well, it is coming soon...
Attachment #8611372 - Attachment is obsolete: true
Attachment #8612760 - Flags: review+
Attached patch bug_1152046_telemetry_v1.patch (obsolete) — Splinter Review
Attachment #8613916 - Flags: review?(honzab.moz)
Comment on attachment 8613916 [details] [diff] [review]
bug_1152046_telemetry_v1.patch

Review of attachment 8613916 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsSocketTransport2.cpp
@@ +53,5 @@
>  #endif
>  /* End keepalive config inclusions. */
>  
> +// If PR_Connect, PR_Close and PR_ContinueConnect are longer than
> +// STS_HANG_DURATION ms, consired it a hang and collect data about the state we

consired (typo)

@@ +1410,5 @@
>  
>      NetAddrToPRNetAddr(&mNetAddr, &prAddr);
>  
>      MOZ_EVENT_TRACER_EXEC(this, "net::tcp::connect");
> +    PRIntervalTime connectStarted;

please explain why you prefer PR intervals here (atomicity)

@@ +3100,5 @@
> +                                      STS_STATE_SHUTDOWN);
> +
> +            } else if ((gIOService->LastConnectivityChange() >= closeStarted)
> +                       &&
> +                       (gIOService->LastOfflineStateChange() >= closeStarted)){

I think we should include some window after the change.  Like "has been there a change in the net config in, say, last 60 seconds".  If so, add a 'state-changed' telemetry.  Otherwise, add 'normal-state' telemetry.

Then, I'm not sure change in the 'connectivity yes/no' state is what you want... Maybe check with :badger on this first.  There can be a wire connection, then for awhile both wired and wifi, then wire goes away (and that will not be seen as a change in connectivity..)

@@ +3116,5 @@
> +            } else {
> +                Telemetry::Accumulate(Telemetry::STS_LONG_PRCLOSE_STATE,
> +                                      STS_STATE_NORMAL);
> +            }
> +        }

please share the code, it's easy

::: toolkit/components/telemetry/Histograms.json
@@ +2462,5 @@
>      "high": "60000",
>      "n_buckets": 1000,
>      "description": "Time spend blocked on poll (ms)."
>    },
> +  "STS_PRCONNECT_TIME": {

The name should "PRCONNECT_BLOCKING_TIME"

@@ +2500,5 @@
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": 5,
> +    "description": "If it is in PR_Close more than 2000 ms, collect telemetry about the state: 4: shutdown, 3: offline and connectivity state changed 2: offline state changed, 1: connectivity change and 0: none."
> +  },

I'd rather have telemetry like:
PRCONNECT_NORMAL
PRCONNECT_AFTER_NETWORK_CHANGE
PRCONNECT_AFTER_SHUTDOWN

etc.. collecting time (exponential)
Attachment #8613916 - Flags: review?(honzab.moz) → feedback-
(In reply to Honza Bambas (:mayhemer) from comment #14)
> Comment on attachment 8613916 [details] [diff] [review]
> bug_1152046_telemetry_v1.patch
> 
> Review of attachment 8613916 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +3100,5 @@
> > +                                      STS_STATE_SHUTDOWN);
> > +
> > +            } else if ((gIOService->LastConnectivityChange() >= closeStarted)
> > +                       &&
> > +                       (gIOService->LastOfflineStateChange() >= closeStarted)){
> 
> I think we should include some window after the change.  Like "has been
> there a change in the net config in, say, last 60 seconds".  If so, add a
> 'state-changed' telemetry.  Otherwise, add 'normal-state' telemetry.
> 
> Then, I'm not sure change in the 'connectivity yes/no' state is what you
> want... Maybe check with :badger on this first.  There can be a wire
> connection, then for awhile both wired and wifi, then wire goes away (and
> that will not be seen as a change in connectivity..)
> 

Looking at the code I am sure that the connectivity means we have some connection or we do not have any. I will check this with :badger
I want to separate connectivity and offline state which are 2 separate things. Looking at the code connectivity state will change offline state if offline state has not been force for some other reason.

So I want to have:
PRCONNECT_BLOCKING_NORMAL - no shutdown ,now network change no ofline state change
PRCONNECT_BLOCKING_SHUTDOWN - we are in shutdown
PRCONNECT_BLOCKING_NETWORK_CHANGE - connectivity changed
PRCONNECT_BLOCKING_OFFLINE - set offline (this will probably be the same as normal, exept we have some error in necko code)
Attached patch bug_1152046_telemetry_v2.patch (obsolete) — Splinter Review
Can you take a look into connectivity bit.

I wanted to save the last time connectivity change - connected/disconnected.
Attachment #8613916 - Attachment is obsolete: true
Attachment #8614624 - Flags: feedback?(daniel)
Tracking for 40. Are other versions affected?
Daniel, can you help here?
Flags: needinfo?(daniel)
I see no problems in this!
Flags: needinfo?(daniel)
Attachment #8614624 - Flags: feedback?(daniel) → feedback+
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #12)
> https://hg.mozilla.org/mozilla-central/rev/1badbdeca325

Landed on Aurora a=sledru to fix uplift bustage from bug 1169554.
https://hg.mozilla.org/releases/mozilla-aurora/rev/6b9bf28f9b8a
Attached patch bug_1152046_telemetry_v2.patch (obsolete) — Splinter Review
Attachment #8614624 - Attachment is obsolete: true
Attachment #8628777 - Flags: review?(honzab.moz)
Comment on attachment 8628777 [details] [diff] [review]
bug_1152046_telemetry_v2.patch

Review of attachment 8628777 [details] [diff] [review]:
-----------------------------------------------------------------

Please few of the nits and if there are no major changes you want to do, just land it!  Good work.

::: netwerk/base/nsIOService.cpp
@@ +1521,5 @@
>      }
>  
>      bool isUp = true;
>      if (!strcmp(data, NS_NETWORK_LINK_DATA_CHANGED)) {
> +        mLastNetworkLinkChange = PR_IntervalNow();

please double check this.  I'm not sure this fires every time..

::: netwerk/base/nsIOService.h
@@ +173,5 @@
>      static bool                          sTelemetryEnabled;
> +
> +    // These are needed for the telemetry about PR_Connect anf PR_Close
> +    // duration. If PRConnect or PR_Close last to long check whether
> +    // connectivity or offline state has change in between.

Rather:

"These timestamps are needed for collecting telemetry on PR_Connect, PR_ConnectContinue and PR_Close blocking time.  On too long time spent in any of these functions we want to know what network change happened shortly before."

BTW, it might be smarter to collect only times that take more then say 1 second?  I am really not sure on this, but I don't want to see a graph of 0 - 0.1 ms -> 99% and then the rest (that we are interested in!) hidden in the 1% percent of all the captures.

::: netwerk/base/nsSocketTransport2.cpp
@@ +1406,3 @@
>      status = PR_Connect(fd, &prAddr, NS_SOCKET_CONNECT_TIMEOUT);
> +
> +    if (gSocketTransportService->IsTelemetryEnabled() && connectStarted) {

if you want the "&& connectStarted" part that you must initialize connectStarted to 0.  maybe have the condition just as "if (connectStarted) {" ?

@@ +1881,3 @@
>          PRStatus status = PR_ConnectContinue(fd, outFlags);
> +
> +        if (gSocketTransportService->IsTelemetryEnabled() && connectStarted) {

same here

::: netwerk/base/nsSocketTransportService2.h
@@ +109,5 @@
>  
>      // Returns true if keepalives are enabled in prefs.
>      bool IsKeepaliveEnabled() { return mKeepaliveEnabledPref; }
> +
> +    bool IsTelemetryEnabled() { return mTelemetryEnabledPref; }

some small comment would be nice
Attachment #8628777 - Flags: review?(honzab.moz) → review+
Attachment #8628777 - Attachment is obsolete: true
Attachment #8630588 - Flags: review+
 Check in needed only for patch bug_1152046_telemetry_v2.patch
Keywords: checkin-needed
Status: NEW → ASSIGNED
Honza, just wanted to ping you on whether we have a fix in the works on this one or a patch ready for uplift to beta. bug 1158189 is a top crash on beta channel and blocked by this bug. Thanks!
Flags: needinfo?(honzab.moz)
I will answer and Honza can correct me.

We currently collecting telemetry to see when it is happening. We assume it is after network connection change or maybe just when there is no connectivity. 

For Honza:
Form telemetry that we have collected so far I see this happening for all cases some times. some are pretty high on windows and on osx but not as high as in windows. I could do a fix to split connect to a separate thread. (I need to look closely into the data to see if we can skip in some cases)
(In reply to Dragana Damjanovic [:dragana] from comment #29)
> I will answer and Honza can correct me.
> 
> We currently collecting telemetry to see when it is happening. We assume it
> is after network connection change or maybe just when there is no
> connectivity. 
> 
> For Honza:
> Form telemetry that we have collected so far I see this happening for all
> cases some times. some are pretty high on windows and on osx but not as high
> as in windows. I could do a fix to split connect to a separate thread. (I
> need to look closely into the data to see if we can skip in some cases)

As Dragana says, the only fix here is to either do more investigation like contacting the people this happens to (not a usual way we work in Mozilla tho) or have separate thread(s) handling these function calls.

I think having a thread pool (limited up to like 5 or so threads) is the way here as a general workaround.

I personally believe there is some antivirus or virus (how ironic :)) involved here, but that is more a guess.
Flags: needinfo?(honzab.moz)
Peak into telemetry between 8. and 20. July 2015.

The numbers next to the platform names show the maximal value seen.

Linux values are very low in comparison to osx and windows. And Windows has much higher numbers than osx therefore most of the crashes are on windows.

PRCONNECT_BLOCKING_TIME_CONNECTIVITY_CHANGE:
  linux 26ms
  osx   23s
  windows even 2 semples above 60s

PRCONNECT_BLOCKING_TIME_LINK_CHANGE:
  linux -
  osx 700ms
  windows 9s

PRCONNECT_BLOCKING_TIME_NORMAL
  linux 116ms
  osx   9s 
  windows 19 samples above 60s

PRCONNECT_BLOCKING_TIME_OFFLINE
  (small sample size)
  linux (very small sample size 40 - 1-2ms)
  osx (very small sample size 200 - 1-2ms)
  windows 43s


PRCONNECTCONTINUE_BLOCKING_TIME_CONNECTIVITY_CHANGE
  linux 23ms
  osx 97ms
  windows 11s

PRCONNECTCONTINUE_BLOCKING_TIME_LINK_CHANGE
  linux -
  osx 46ms
  windows 10s

PRCONNECTCONTINUE_BLOCKING_TIME_NORMAL
  linux 73ms
  osx 133ms
  windows 2samples 60s

PRCONNECTCONTINUE_BLOCKING_TIME_OFFLINE:
 (small sample size)
 windows 16ms

PRCLOSE_BLOCKING_TIME_CONNECTIVITY_CHANGE
  linux 53ms
  osx 16s
  windows 1sample 60s

PRCLOSE_BLOCKING_TIME_LINK_CHANGE
  linux -
  osx 1s
  windows 48s

PRCLOSE_BLOCKING_TIME_NORMAL
  linux 11s
  osx 27s
  windows 9 samples 60s

PRCLOSE_BLOCKING_TIME_OFFLINE:
  (small sample size)
  linux 6s
  osx 86ms
  windows 16s
damn. those are crazy numbers. Dragana do these numbers look different at the 98th or 99th percentile rather than max? If not, then this is pretty nuts. (If so, even simple things like some kind of massive memory thrashing can explain .0001 outliers which would be irrelevant to us) 

I think it is interesting that the close_blocking_time_normal is so bad cross platform.

Honza, do you think there is any reason to have this done synchronously (to the rest of gecko)? 

can we just implement pr_close at the bottom of the stack to return true without closing anything, but pass the OS fd to a thread that queues them up and closes them as fast as possible. I suggest this rather than a thread pool as a possibility because the thread pool still creates a slow sync point for the outlier cases (though improves HoL blocking).

at least we have a good metric we could track now. Thanks Dragana.
Flags: needinfo?(honzab.moz)
Flags: needinfo?(dd.mozilla)
(In reply to Patrick McManus [:mcmanus] from comment #32)
> damn. those are crazy numbers. Dragana do these numbers look different at
> the 98th or 99th percentile rather than max? If not, then this is pretty
> nuts. (If so, even simple things like some kind of massive memory thrashing
> can explain .0001 outliers which would be irrelevant to us) 

Yep, it would be good to check what is the percentile of blocks taking more than say 4 seconds?  Or any other reasonable number of seconds.

> 
> I think it is interesting that the close_blocking_time_normal is so bad
> cross platform.
> 
> Honza, do you think there is any reason to have this done synchronously (to
> the rest of gecko)? 
> 
> can we just implement pr_close at the bottom of the stack to return true
> without closing anything, but pass the OS fd to a thread that queues them up
> and closes them as fast as possible. I suggest this rather than a thread
> pool as a possibility 


"because the thread pool still creates a slow sync point for the outlier cases" 

I don't follow this argument.  Are you suggesting we have one thread per one close (connect)?  I did the same for udp socket close - no pool, spawn a new thread for any invocation of nsUDPSocket::Close() ending up at PR_Close().  But we use udp more rarely than tcp so to have a tun of threads for each connect we call seems to me like waste of resources.  But I can be persuaded otherwise.  Both solutions are good.  Only problem is that impact of raising number of threads is hard to measure, it's kind of a slow death...

> (though improves HoL blocking).
> 
> at least we have a good metric we could track now. Thanks Dragana.

Yep, thanks!
Flags: needinfo?(honzab.moz)
roughly:


1] put a nspr layer where layer->lower == tcp/udp

2] implement pr_close(aFD) {
int fd = GetOsFD(aFD); // I forget what this is really called
aFD->lower->dtor(aFD->lower); // or maybe set fd to -1 and call lower->close()
postmessage(closingThread, fd);
return 0;
}

on closingThread
while(!shutdown) {
recvmessage(&fd);
close(fd);
}
(In reply to Patrick McManus [:mcmanus] from comment #32)
> damn. those are crazy numbers. Dragana do these numbers look different at
> the 98th or 99th percentile rather than max? If not, then this is pretty
> nuts. (If so, even simple things like some kind of massive memory thrashing
> can explain .0001 outliers which would be irrelevant to us) 
> 



I should have added this stats too, I was just surprised about max values they are crazy.
so 95% even 99% are only 1ms.
there are 3365 samples that are larger than 1second and 1453 larger than 4s.

Dispatching this to a pull of threads or just separate threads is going to make this worse for everybody else.  I am not sure we should do this.

> I think it is interesting that the close_blocking_time_normal is so bad
> cross platform.
> 
> Honza, do you think there is any reason to have this done synchronously (to
> the rest of gecko)? 
> 
> can we just implement pr_close at the bottom of the stack to return true
> without closing anything, but pass the OS fd to a thread that queues them up
> and closes them as fast as possible. I suggest this rather than a thread
> pool as a possibility because the thread pool still creates a slow sync
> point for the outlier cases (though improves HoL blocking).
> 
> at least we have a good metric we could track now. Thanks Dragana.
Flags: needinfo?(dd.mozilla)
my message got in conflict with your.
The suggestion from comment 34 could work.
(In reply to Patrick McManus [:mcmanus] from comment #34)
> roughly:
> 
> 
> 1] put a nspr layer where layer->lower == tcp/udp
> 
> 2] implement pr_close(aFD) {
> int fd = GetOsFD(aFD); // I forget what this is really called
> aFD->lower->dtor(aFD->lower); // or maybe set fd to -1 and call
> lower->close()
> postmessage(closingThread, fd);
> return 0;
> }
> 
> on closingThread
> while(!shutdown) {
> recvmessage(&fd);
> close(fd);
> }

If we just leak any queued (yet not closed) sockets after shutdown then this is a good solution.  Should also replace my patch for UDP close.

How about connect/cont-connect tho?
Attached patch bug_1152046_prclose_v1.patch (obsolete) — Splinter Review
Here is one version. Please take a look...
Attachment #8641199 - Flags: feedback?(mcmanus)
Comment on attachment 8641199 [details] [diff] [review]
bug_1152046_prclose_v1.patch

Review of attachment 8641199 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!  just few internals to fix.

::: netwerk/base/nsClosingService.cpp
@@ +16,5 @@
> +TcpUdpPRCloseLayerClose(PRFileDesc *aFd)
> +{
> +  PRFileDesc *fd = aFd->lower;
> +  nsClosingService::PostCloseRequest(fd);
> +  aFd = 0;

you probably don't need to do this

@@ +36,5 @@
> +{
> +  PR_DestroyLock(mLock);
> +  PR_DestroyCondVar(mQueueCondVar);
> +  gInstance = nullptr;
> +  mQueue.Clear();

hmm.. the queue must be accessed only on a single thread (which I think cannot be) or always be accessed unconditionally only under the lock

@@ +40,5 @@
> +  mQueue.Clear();
> +}
> +
> +void
> +nsClosingService::Start()

put // static before the method definition (applies to all static method here)

@@ +105,5 @@
> +void
> +nsClosingService::PostCloseRequest(PRFileDesc *aFd)
> +{
> +  // Check if shutdown is called.
> +  if (!sShutdown) {

nicer would be:

if (sShutdown) {
  // Let the socket leak, we want the system take care... bla bla bla
  return;
}

if (..)..

@@ +110,5 @@
> +    // Check if gInstance is present, if not call in the standard way.
> +    // Theoretically this can happen before nsIOService initializes
> +    // nsClosingService.
> +    if (gInstance) {
> +      gInstance->PostRequest(aFd);

so, I'm missing one major stuff here: you are not correctly handling the service lifetime

each layer instance you crate must have its 'secret' set to a struct that has a nsRefPtr<nsClosingService> that refers the service.

here you will pick the refptr from your secret struct and use it instead of gInstance and then just nullify it.  this will make sure the queue and the locks exists.  the code you have may nicely race.

other option is to StaticMutex or StaticMonitor over gInstance

@@ +115,5 @@
> +    } else {
> +      PR_Close(aFd);
> +    }
> +  }
> +  aFd = nullptr;

why?

@@ +179,5 @@
> +  PR_Unlock(mLock);
> +}
> +
> +void
> +nsClosingService::ResolveRequest()

maybe ProcessRequest would be better (if you really want to have a method for this, inlining into ThreadFunc would work well too)

@@ +186,5 @@
> +  if (mQueue.Length() == 0) {
> +    PR_WaitCondVar(mQueueCondVar, PR_INTERVAL_NO_TIMEOUT);
> +  }
> +  if (sShutdown) {
> +    PR_Unlock(mLock);

grrr....

::: netwerk/base/nsClosingService.h
@@ +62,5 @@
> +                            mozilla::Telemetry::ID aIDLinkChange,
> +                            mozilla::Telemetry::ID aIDOffline);
> +
> +  static bool sShutdown;
> +  static nsClosingService *gInstance;

sInstance (or preferably sSelf) since it's an object's static not a global static

@@ +65,5 @@
> +  static bool sShutdown;
> +  static nsClosingService *gInstance;
> +  nsTArray<PRFileDesc *> mQueue;
> +  PRLock *mLock;
> +  PRCondVar *mQueueCondVar;

use mozilla::Monitor please.  if there is some reason not to please add a comment why

::: netwerk/base/nsIncrementalDownload.cpp
@@ +17,5 @@
>  #include "nsIFile.h"
>  #include "nsITimer.h"
> +#include "nsIStreamListener.h"
> +#include "nsIURI.h"
> +#include "nsIInputStream.h"

hmm?

::: netwerk/base/nsSimpleNestedURI.cpp
@@ +7,5 @@
>  
>  #include "nsSimpleNestedURI.h"
>  #include "nsIObjectInputStream.h"
>  #include "nsIObjectOutputStream.h"
> +#include "nsNetUtil.h"

hmm2?

::: netwerk/base/nsURLHelper.cpp
@@ +12,5 @@
>  #include "nsCOMPtr.h"
>  #include "nsCRT.h"
>  #include "nsNetCID.h"
>  #include "prnetdb.h"
> +#include "mozilla/Preferences.h"

hmm3?
Attachment #8641199 - Flags: feedback+
Attachment #8641199 - Flags: feedback?(mcmanus)
We're at the end of the 40 cycle and this bug is now wontfix for 40.

As this bug is visible in 40 Betas and 41 Aurora, I'm tracking for 41+.
Attached patch bug_1152046_prclose_v1.patch (obsolete) — Splinter Review
Attachment #8641199 - Attachment is obsolete: true
https://treeherder.mozilla.org/#/jobs?repo=try&revision=873fff73ca52


Honza, can you take a look please?

There is a leak on socket that is expected to happend. A possibilty is to allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?
Flags: needinfo?(honzab.moz)
I'm not sure what you exactly want from me, can you restate please?
Flags: needinfo?(honzab.moz) → needinfo?(dd.mozilla)
Attached patch bug_1152046_prclose_v1.patch (obsolete) — Splinter Review
May I ask you for a review? Thanks.
Attachment #8650560 - Attachment is obsolete: true
Flags: needinfo?(dd.mozilla) → needinfo?(honzab.moz)
Jason, would you be able to review this? I noticed Honza's bugzilla email says "not reviewing". This bug is tracking for FF41, so if there a fix that is safe, we would want to uplift to Beta41 soon.
Flags: needinfo?(jduell.mcbugs)
Honza has been reviewing previous patches here so I'm going to let him keep doing that.  Bounce it back to me if he hasn't gotten to it in a few days.
Flags: needinfo?(jduell.mcbugs)
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #44)
> I'm not sure what you exactly want from me, can you restate please?

I was actually asking about:

"There is a leak on socket that is expected to happend. A possibilty is to allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?"

what does that mean?
Flags: needinfo?(dd.mozilla)
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #48)
> (In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #44)
> > I'm not sure what you exactly want from me, can you restate please?
> 
> I was actually asking about:
> 
> "There is a leak on socket that is expected to happend. A possibilty is to
> allow PR_Close for let's say 200-500 ms and then let the rest sockets leak?"
> 
> what does that mean?

Sorry I have fix that - I have added PR_Free(fd) which I have forgot before and I have done some other fixes. so there is no leak anymore
Flags: needinfo?(dd.mozilla)
Comment on attachment 8654487 [details] [diff] [review]
bug_1152046_prclose_v1.patch

Review of attachment 8654487 [details] [diff] [review]:
-----------------------------------------------------------------

nsClosingService -> mozilla::net::ClosingService

::: netwerk/base/nsClosingService.cpp
@@ +34,5 @@
> +  if (!aFd) {
> +    return PR_FAILURE;
> +  }
> +
> +  ClosingLayerSecret *closingService =

s/closingService/closingLayerSecret/

@@ +35,5 @@
> +    return PR_FAILURE;
> +  }
> +
> +  ClosingLayerSecret *closingService =
> +    reinterpret_cast<ClosingLayerSecret *>(aFd->secret);

you must do this on your layer!  how can you be sure aFd is what you expect?

@@ +39,5 @@
> +    reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
> +  aFd->secret = nullptr;
> +
> +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> +  NS_ASSERTION(layer &&

MOZ_ASSERT, maybe actually MOZ_RELEASE_ASSERT

@@ +42,5 @@
> +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> +  NS_ASSERTION(layer &&
> +               layer->identity == sTcpUdpPRCloseLayerId,
> +               "Closing Layer not on top of stack");
> +  layer->dtor(layer);

are you sure about this?  doesn't PR_Close on the layer do that?  I think this just delegates to the layer below.

@@ +46,5 @@
> +  layer->dtor(layer);
> +
> +  PRStatus status = PR_SUCCESS;
> +
> +  if (aFd) {

will always be true

@@ +60,5 @@
> +      }
> +    } else {
> +      // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> +      // we are not going to leak anything, but the PR_Close will not be called.
> +      PR_Free(aFd);

shouldn't you use dtor() here?  I'm no expert (it's been too long since the last time I played with this stuff) so rather double check.  but that would make more sense to me.  same in the shutdown case in ThreadFunc

@@ +69,5 @@
> +  return status;
> +}
> +
> +mozilla::Atomic<bool> nsClosingService::sShutdown(false);
> +nsRefPtr<nsClosingService> nsClosingService::sInstance = nullptr;

these both are static initializer, something we don't want

you can have a raw sInstance ptr and NS_ADDREF in ctor and NS_RELEASE in dtor.  be careful with multiple threads tho..

@@ +158,5 @@
> +    PR_Free(aFd);
> +    return;
> +  }
> +
> +  bool notify = (mQueue.Length() == 0)? true : false;

)? 

..missing space

@@ +161,5 @@
> +
> +  bool notify = (mQueue.Length() == 0)? true : false;
> +
> +  mQueue.AppendElement(aFd);
> +  if (notify) {

if (mQueue.Length() == 1) ?

@@ +174,5 @@
> +  MOZ_ASSERT(NS_IsMainThread());
> +
> +  if (sInstance) {
> +    sInstance->ShutdownInternal();
> +    sInstance = nullptr;

so, other thread may race and take ref to sInstance after is has been shut down?  Is that OK?  Maybe it is, just checking..

@@ +200,5 @@
> +
> +void
> +nsClosingService::ThreadFunc()
> +{
> +  while (1) {

for (;;) or while (true) are prefered

@@ +217,5 @@
> +          // PR_NSPR_IO_LAYER we are not going to leak anything, but PR_Close
> +          // will not be called.
> +          PR_Free(fd);
> +        }
> +        mQueue.Clear(); 

ws

::: netwerk/base/nsClosingService.h
@@ +28,5 @@
> +
> +namespace mozilla {
> +namespace net {
> +
> +class nsClosingService final : public nsISupports

to make nsRefPtr work with this you just need to define AddRef and Release.  There is a macro in nsISupportsImpl.h NS_INLINE_DECL_THREADSAFE_REFCOUNTING for inline declaration of those two for you.  Don't derive from isupports when not needed (which is not in this case)

@@ +58,5 @@
> +                            mozilla::Telemetry::ID aIDConnectivityChange,
> +                            mozilla::Telemetry::ID aIDLinkChange,
> +                            mozilla::Telemetry::ID aIDOffline);
> +
> +  static nsRefPtr<nsClosingService> sInstance;

hm... this is a static initializer.  not good..

::: netwerk/base/nsIOService.cpp
@@ +43,5 @@
>  #include "nsIWidget.h"
>  #include "nsThreadUtils.h"
>  #include "mozilla/LoadInfo.h"
>  #include "mozilla/net/NeckoCommon.h"
> +#include "mozilla/Services.h"

why are all the includes here?

::: netwerk/base/nsSocketTransportService2.cpp
@@ +557,5 @@
>  
> +#if !defined(MOZILLA_XPCOMRT_API)
> +    // Start closing service. The PRClose functions will be carried out on
> +    // the closeService thread. Started here so that it is started just
> +    // once.

Start the closing service.  Actual PR_Close() will be carried out on a separate "closing" thread.  Start the closing service here since this point is executed only once per session.

::: netwerk/base/nsURLHelper.cpp
@@ +11,5 @@
>  #include "nsIURLParser.h"
>  #include "nsCOMPtr.h"
>  #include "nsCRT.h"
>  #include "nsNetCID.h"
> +#include "mozilla/Preferences.h"

why this include?
Attachment #8654487 - Flags: review-
Flags: needinfo?(honzab.moz)
I am not sure if the fix will come in time for 41. We have lived one release (40) with this issue and we may have to live with it for another release 41. Updating status-firefox41 to wontfix.

If the fix is ready in the next 3-4 days, and safe to uplift to Beta, please nominate.
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #51)
> Comment on attachment 8654487 [details] [diff] [review]
> bug_1152046_prclose_v1.patch
> 
> Review of attachment 8654487 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> @@ +35,5 @@
> > +    return PR_FAILURE;
> > +  }
> > +
> > +  ClosingLayerSecret *closingService =
> > +    reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
> 
> you must do this on your layer!  how can you be sure aFd is what you expect?


If we are not on the right layer something is really wrong, but you I right it should be one in the correct order.


> 
> @@ +42,5 @@
> > +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> > +  NS_ASSERTION(layer &&
> > +               layer->identity == sTcpUdpPRCloseLayerId,
> > +               "Closing Layer not on top of stack");
> > +  layer->dtor(layer);
> 
> are you sure about this?  doesn't PR_Close on the layer do that?  I think
> this just delegates to the layer below.
> 


We do not want to call PR_Close. PR_Close calls te close function of this layer with is the function that we are in :)
layer->dtor and PR_Free(layer) will end up doing the same thing because the layer is popped up and the stack connections are cut (lower and higher are set to null)


> @@ +46,5 @@
> > +  layer->dtor(layer);
> > +
> > +  PRStatus status = PR_SUCCESS;
> > +
> > +  if (aFd) {
> 
> will always be true
> 

aFd does not have the same value as at the beginning of the function. Call to PR_PopIOLayer has set aFd to the next layer in the chain.
aFd should be true but I have added this check just to be sure. 


> @@ +60,5 @@
> > +      }
> > +    } else {
> > +      // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> > +      // we are not going to leak anything, but the PR_Close will not be called.
> > +      PR_Free(aFd);
> 
> shouldn't you use dtor() here?  I'm no expert (it's been too long since the
> last time I played with this stuff) so rather double check.  but that would
> make more sense to me.  same in the shutdown case in ThreadFunc
> 

the only difference between PR_Free and dtor is that dtor cuts connection in the layer chain (sets lower and higher to null), but in this case this does not matter which one we are using.

> 
> @@ +174,5 @@
> > +  MOZ_ASSERT(NS_IsMainThread());
> > +
> > +  if (sInstance) {
> > +    sInstance->ShutdownInternal();
> > +    sInstance = nullptr;
> 
> so, other thread may race and take ref to sInstance after is has been shut
> down?  Is that OK?  Maybe it is, just checking..
> 

it is ok because mShutdown is checked before putting a request in the queue.



> ::: netwerk/base/nsIOService.cpp
> @@ +43,5 @@
> >  #include "nsIWidget.h"
> >  #include "nsThreadUtils.h"
> >  #include "mozilla/LoadInfo.h"
> >  #include "mozilla/net/NeckoCommon.h"
> > +#include "mozilla/Services.h"
> 
> why are all the includes here?
> 
> ::: netwerk/base/nsURLHelper.cpp
> @@ +11,5 @@
> >  #include "nsIURLParser.h"
> >  #include "nsCOMPtr.h"
> >  #include "nsCRT.h"
> >  #include "nsNetCID.h"
> > +#include "mozilla/Preferences.h"
> 
> why this include?

this fixes compile errors (cause by adding a new file to moz.build that probably caused file to be moved from on unified file to another).
Attached patch bug_1152046_prclose_v2.patch (obsolete) — Splinter Review
Attachment #8654487 - Attachment is obsolete: true
Flags: needinfo?(honzab.moz)
(In reply to Dragana Damjanovic [:dragana] from comment #54)
> (In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #51)
> > Comment on attachment 8654487 [details] [diff] [review]
> > bug_1152046_prclose_v1.patch
> > 
> > Review of attachment 8654487 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > @@ +35,5 @@
> > > +    return PR_FAILURE;
> > > +  }
> > > +
> > > +  ClosingLayerSecret *closingService =
> > > +    reinterpret_cast<ClosingLayerSecret *>(aFd->secret);
> > 
> > you must do this on your layer!  how can you be sure aFd is what you expect?
> 
> 
> If we are not on the right layer something is really wrong, but you I right
> it should be one in the correct order.

If the code ensures you are on the right level then OK.  But some kind of assertion at least would be at place.

> 
> 
> > 
> > @@ +42,5 @@
> > > +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> > > +  NS_ASSERTION(layer &&
> > > +               layer->identity == sTcpUdpPRCloseLayerId,
> > > +               "Closing Layer not on top of stack");
> > > +  layer->dtor(layer);
> > 
> > are you sure about this?  doesn't PR_Close on the layer do that?  I think
> > this just delegates to the layer below.
> > 
> 
> 
> We do not want to call PR_Close. PR_Close calls te close function of this
> layer with is the function that we are in :)
> layer->dtor and PR_Free(layer) will end up doing the same thing because the
> layer is popped up and the stack connections are cut (lower and higher are
> set to null)

Aha :)

> 
> 
> > @@ +46,5 @@
> > > +  layer->dtor(layer);
> > > +
> > > +  PRStatus status = PR_SUCCESS;
> > > +
> > > +  if (aFd) {
> > 
> > will always be true
> > 
> 
> aFd does not have the same value as at the beginning of the function. Call
> to PR_PopIOLayer has set aFd to the next layer in the chain.
> aFd should be true but I have added this check just to be sure. 

OK, leave it as is.

> 
> 
> > @@ +60,5 @@
> > > +      }
> > > +    } else {
> > > +      // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> > > +      // we are not going to leak anything, but the PR_Close will not be called.
> > > +      PR_Free(aFd);
> > 
> > shouldn't you use dtor() here?  I'm no expert (it's been too long since the
> > last time I played with this stuff) so rather double check.  but that would
> > make more sense to me.  same in the shutdown case in ThreadFunc
> > 
> 
> the only difference between PR_Free and dtor is that dtor cuts connection in
> the layer chain (sets lower and higher to null), but in this case this does
> not matter which one we are using.

If that is the only difference then I would pref calling dtor here.  It may also release any secret structs PR_Free has no chance to ever release.



I'll do the review of the new version now.
Comment on attachment 8657801 [details] [diff] [review]
bug_1152046_prclose_v2.patch

Review of attachment 8657801 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/ClosingService.cpp
@@ +6,5 @@
> +
> +#include "ClosingService.h"
> +#include "nsIOService.h"
> +
> +using namespace mozilla::net;

instead using put this all to mozilla::net namespace.

@@ +12,5 @@
> +static PRIOMethods sTcpUdpPRCloseLayerMethods;
> +static PRIOMethods *sTcpUdpPRCloseLayerMethodsPtr = nullptr;
> +static PRDescIdentity sTcpUdpPRCloseLayerId;
> +
> +class ClosingLayerSecret

this must be in an anonymous namespace

@@ +38,5 @@
> +
> +  PRFileDesc* layer = PR_PopIOLayer(aFd, PR_TOP_IO_LAYER);
> +  MOZ_RELEASE_ASSERT(layer &&
> +                     layer->identity == sTcpUdpPRCloseLayerId,
> +                     "Closing Layer not on top of stack");

aha, it's here!  good.

@@ +51,5 @@
> +    // let it leak.
> +    if (gIOService->IsShutdown()) {
> +      // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> +      // we are not going to leak anything, but the PR_Close will not be called.
> +      PR_Free(aFd);

I'd prefer using dtor here, but if there is a problem with that, leave PR_Free

@@ +57,5 @@
> +      closingLayerSecret->mClosingService->PostRequest(aFd);
> +    } else {
> +      // Socket is created before closing service has been started or there was
> +      // a problem with starting it.
> +      status = aFd->methods->close(aFd);

PR_Close(aFd)?

@@ +73,5 @@
> +ClosingService::ClosingService()
> +  : mShutdown(false)
> +  , mMonitor("ClosingService.mMonitor")
> +{
> +  NS_ASSERTION(sInstance==nullptr,

MOZ_ASSERT(!sIntance) - preferred - or sInstance == nullptr

@@ +87,5 @@
> +    PR_ASSERT(PR_INVALID_IO_LAYER != sTcpUdpPRCloseLayerId);
> +
> +    sTcpUdpPRCloseLayerMethods = *PR_GetDefaultIOMethods();
> +    sTcpUdpPRCloseLayerMethods.close = TcpUdpPRCloseLayerClose;
> +    sTcpUdpPRCloseLayerMethodsPtr = &sTcpUdpPRCloseLayerMethods;

nit: why is sTcpUdpPRCloseLayerMethodsPtr needed?

@@ +91,5 @@
> +    sTcpUdpPRCloseLayerMethodsPtr = &sTcpUdpPRCloseLayerMethods;
> +  }
> +
> +  if (!sInstance) {
> +    ClosingService* service = new ClosingService();

it was better in the previous version (to use nsRefPtr here)

@@ +128,5 @@
> +  layer = PR_CreateIOLayerStub(sTcpUdpPRCloseLayerId,
> +                               sTcpUdpPRCloseLayerMethodsPtr);
> +
> +  if (!layer) {
> +    return NS_ERROR_FAILURE;

this may IMO not be fatal.

@@ +139,5 @@
> +
> +  if (status == PR_FAILURE) {
> +    delete secret;
> +    PR_DELETE(layer);
> +    return NS_ERROR_FAILURE;

I would not make this fatal too...  but depends.  your call.

@@ +156,5 @@
> +    // time. To prevent shutdown crash (bug 1152046) do not accept sockets any
> +    // more.
> +    // If the ClosingService layer is the first layer above PR_NSPR_IO_LAYER
> +    // we are not going to leak anything, but PR_Close will not be called.
> +    PR_Free(aFd);

dtor here as well?

@@ +176,5 @@
> +  if (sInstance) {
> +    sInstance->ShutdownInternal();
> +    ClosingService* service = sInstance;
> +    sInstance = nullptr;
> +    NS_RELEASE(service);

I think you can just NS_RELEASE(sInstance).  It will nullify sInstance for you.

::: netwerk/base/nsSocketTransportService2.cpp
@@ +555,5 @@
>          obsSvc->AddObserver(this, "last-pb-context-exited", false);
>      }
>  
> +#if !defined(MOZILLA_XPCOMRT_API)
> +    // Start the closing service. Actual PRClose() will be carried out on

PR_Close()
Attachment #8657801 - Flags: review+
Flags: needinfo?(honzab.moz)
->dtor does not work it must be PR_Free
Attachment #8657801 - Attachment is obsolete: true
Attachment #8659673 - Flags: review+
Keywords: checkin-needed
hi, the first patch didn't ally cleanly:

renamed 1152046 -> bug_1152046_v1.patch
applying bug_1152046_v1.patch
patching file netwerk/base/nsIOService.h
Hunk #1 FAILED at 72
1 out of 1 hunks FAILED -- saving rejects to file netwerk/base/nsIOService.h.rej
patching file netwerk/base/nsSocketTransport2.cpp
Hunk #1 FAILED at 1211
1 out of 1 hunks FAILED -- saving rejects to file netwerk/base/nsSocketTransport2.cpp.rej
patch failed, unable to continue (try -v)
patch failed, rejects left in working directory

is this checkin needed requests for all patches ?
Flags: needinfo?(dd.mozilla)
Keywords: checkin-needed
Comment on attachment 8659673 [details] [diff] [review]
bug_1152046_prclose_v2.patch

Only for patch bug_1152046_prclose_v2.patch.

Thanks.
Flags: needinfo?(dd.mozilla)
Attachment #8659673 - Flags: checkin+
Keywords: checkin-needed
backed out comment 63 due to crashes in 1205016
Attachment #8659673 - Flags: checkin+
Something that's landed on top of the original landing isn't playing well with the backout. So I've had to backout the backout. We were seeing b2g debug shutdown hangs.
This bug is causing performance regression on b2g. We need a forward fix in bug 1204837 with this patch on m-c.

Since this bug is specific to Windows, will the fix be limited to Windows in the next revision? Or we'll still have this on all platforms even neither Linux nor MacOS has this problem?
Flags: needinfo?(dd.mozilla)
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #69)
> This bug is causing performance regression on b2g. We need a forward fix in
> bug 1204837 with this patch on m-c.
> 
> Since this bug is specific to Windows, will the fix be limited to Windows in
> the next revision? Or we'll still have this on all platforms even neither
> Linux nor MacOS has this problem?

Sometimes PR_Close blocks for a long time, it is rare but still it happens. The blocking time on Windows are longer than on other (linux, osx) but still long blocking times happens on all os. So we will keep this change for all os.

Sorry I did not know about bug 1204837.
Flags: needinfo?(dd.mozilla)
I moved ClosingService starting/stoping into nsIOService to be sure that it is happening only once.
I think this is not going to solve the crash, but I want to change this.

Honza, may I ask you for a review, please.
Flags: needinfo?(honzab.moz)
So this landed, and broke Linux stuff.
This was reverted, and broke b2g debug stuff.
That revert was then reverted (so back to this landed), so b2g debug stuff is working, but Linux stuff is broken again.


So what's the plan here? We can't leave both Linux stuff or b2g debug stuff broken like this.
Flags: needinfo?(mcmanus)
Flags: needinfo?(dd.mozilla)
dragana is actively working it. no idea why the b2g stuff came to depend on it in 24 hrs.
Flags: needinfo?(mcmanus)
Note: I backed out bug 1152046 (again) locally, since I was insta-crashing with it in.  That lets me run, but testing https://mozilla.github.com/webrtc-landing/pc_test.html crashes in the same place (jemalloc/internal/bitmap.h) when in e10s - but not in a non-e10s profile.  So it's very touchy and machine/perf/allocation/? dependent.

This also implies to me the actual bug is elsewhere (jemalloc?)
I am working on linux problem.

I do not know why b2g is broken. I was check, the files that this patch change has not been changed after this patch had landed.

This patch does not call PR_Close after we enter shutdown, it just cleans up , but does not close socket. b2g is a shutdown hang, maybe we should figure out in any case why it hangs if this patch gets backed out.

Nigel, do you know what has change in b2g, probably something about sockets?
Flags: needinfo?(dd.mozilla) → needinfo?(nigelbabu)
I have no idea. I'm probably not the best person to answer this question.
Flags: needinfo?(nigelbabu)
No longer blocks: 1001691
Attachment #8664191 - Flags: review?(mcmanus)
Attachment #8664191 - Flags: review?(mcmanus) → review+
Attachment #8664191 - Flags: checkin+
Keywords: checkin-needed
checkin only for patch bug_1152046_disable_on_ffos.patch
fix bustage. Sorry.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a34a9dbaf39
Attachment #8664191 - Attachment is obsolete: true
Flags: needinfo?(dd.mozilla)
Attachment #8664957 - Flags: review+
Attachment #8664957 - Flags: checkin+
checkin needed only for patch  bug_1152046_disable_on_ffos.patch
Keywords: checkin-needed
sorry had to back this out again for another bustage, this time b2g https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-inbound
Flags: needinfo?(dd.mozilla)
(In reply to Carsten Book [:Tomcat] from comment #87)
> sorry had to back this out again for another bustage, this time b2g
> https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> inbound

This looks like a shutdown deadlock on the CompositorThread as in https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> (In reply to Carsten Book [:Tomcat] from comment #87)
> > sorry had to back this out again for another bustage, this time b2g
> > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > inbound
> 
> This looks like a shutdown deadlock on the CompositorThread as in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.


Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService thread on b2g (until bug 1204837). 

I thing there is another bug which is a deadlock and this enabling/disabling ClosingService only exposes it.
Currently all calls to PR_Close are dispatch to a ClosingService thread. and patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close are called on SocketThread (this use to be before  bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a timing issue that this patch exposes another bug. I do not see any other connection between CompositorThread and this bug.
What do you think?
Flags: needinfo?(dd.mozilla) → needinfo?(cyu)
(In reply to Dragana Damjanovic [:dragana] from comment #90)
> 
> Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> thread on b2g (until bug 1204837). 
> 
> I thing there is another bug which is a deadlock and this enabling/disabling
> ClosingService only exposes it.
> Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> are called on SocketThread (this use to be before 
> bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> timing issue that this patch exposes another bug. I do not see any other
> connection between CompositorThread and this bug.
> What do you think?

Exactly. Some patch introduced this regression after ClosingService and was hidden. Now it's revealed. I don't think ClosingService has anything to do with the deadlock on mochitest shutdown.
Flags: needinfo?(cyu)
(In reply to Dragana Damjanovic [:dragana] from comment #90)
> (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> > (In reply to Carsten Book [:Tomcat] from comment #87)
> > > sorry had to back this out again for another bustage, this time b2g
> > > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > > inbound
> > 
> > This looks like a shutdown deadlock on the CompositorThread as in
> > https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
> 
> 
> Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> thread on b2g (until bug 1204837). 
> 
> I thing there is another bug which is a deadlock and this enabling/disabling
> ClosingService only exposes it.
> Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> are called on SocketThread (this use to be before 
> bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> timing issue that this patch exposes another bug. I do not see any other
> connection between CompositorThread and this bug.
> What do you think?

If you are looking for a quick fix, just go with disabling the code specifically per platforms it causes issues on.  We can investigate later what's actually wrong.

Personally I think your service shutdown code and some of my 'concurrent access' concerns might be worth checking on.
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #92)
> (In reply to Dragana Damjanovic [:dragana] from comment #90)
> > (In reply to Cervantes Yu [:cyu] [:cervantes] from comment #89)
> > > (In reply to Carsten Book [:Tomcat] from comment #87)
> > > > sorry had to back this out again for another bustage, this time b2g
> > > > https://treeherder.mozilla.org/logviewer.html#?job_id=14625952&repo=mozilla-
> > > > inbound
> > > 
> > > This looks like a shutdown deadlock on the CompositorThread as in
> > > https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14.
> > 
> > 
> > Patch bug_1152046_disable_on_ffos.patch only disables use of ClosingService
> > thread on b2g (until bug 1204837). 
> > 
> > I thing there is another bug which is a deadlock and this enabling/disabling
> > ClosingService only exposes it.
> > Currently all calls to PR_Close are dispatch to a ClosingService thread. and
> > patch bug_1152046_disable_on_ffos.patch disables ClosingService so PR_Close
> > are called on SocketThread (this use to be before 
> > bug_1152046_prclose_v2.patch landed, around 15.September). It is probably a
> > timing issue that this patch exposes another bug. I do not see any other
> > connection between CompositorThread and this bug.
> > What do you think?
> 
> If you are looking for a quick fix, just go with disabling the code
> specifically per platforms it causes issues on.  We can investigate later
> what's actually wrong.
> 
> Personally I think your service shutdown code and some of my 'concurrent
> access' concerns might be worth checking on.

I need info you for reviewing patch  bug_1152046_move_CLosingService_start_toioservice.patch


and the issue you have responded got complications: see bug 1207979, 1208019
Flags: needinfo?(honzab.moz)
Please note bug 1204837 lands a fix for ClosingService on b2g so we don't need to disable it on b2g anymore.
Dragana, it wasn't clear what you needed.

I'll review the patch.

I don't follow what you need from me on "the issues".

- according comment 94 it seems we no longer need bug 1208019, right?
- the same may apply to bug 1207979, or not?

Please state more clearly how I can help here.
Flags: needinfo?(honzab.moz) → needinfo?(dd.mozilla)
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #95)
> Dragana, it wasn't clear what you needed.
> 
> I'll review the patch.
> 
> I don't follow what you need from me on "the issues".
> 
> - according comment 94 it seems we no longer need bug 1208019, right?
> - the same may apply to bug 1207979, or not?
> 
> Please state more clearly how I can help here.

Yes we do not need them, but there is another bug that I do not want to be missed, because it is going to show up at some point.
Explanation:

Patch  bug_1152046_prclose_v2.patch (the implementation of ClosingService) was pushed on 14.Sep. On the same day couple of hours later they tried to back it out but did not work because of:
https://bugzilla.mozilla.org/show_bug.cgi?id=1204837#c14

The thing is that a patch from bug 1127270 was pushed on the same day. And if you disable ClosingService or backed it out that patch cause problems and if the ClosingService is used everything is fine. This bug and bug 1127270 do not have anything in common, so I believe there is a bug in patch 1127270 that shows up only if ClosingService is not used.

So we do not need bug 1208019, but I want that someone looks at bug 1127270 (ImageBridgeParent) and finds explanation why is it failing if ClosingService is not used.
Flags: needinfo?(dd.mozilla)
Comment on attachment 8661965 [details] [diff] [review]
bug_1152046_move_CLosingService_start_toioservice.patch

Review of attachment 8661965 [details] [diff] [review]:
-----------------------------------------------------------------

Hmm.. are you sure that nsSocketTransportService::Init is not called just once?

If there are continuing issues around this code I can take a look deeper next week I'm back at the office.

::: netwerk/base/ClosingService.cpp
@@ +184,5 @@
>  {
>    {
>      mozilla::MonitorAutoLock mon(mMonitor);
> +    if (mShutdown) {
> +      // This should not happened.

"This should never happen."

OTOH, I think it may and having a check like this is good.

Worth adding a similar "only-once" check to the start method?

::: netwerk/base/nsIOService.cpp
@@ +1064,5 @@
>              NS_ASSERTION(NS_SUCCEEDED(rv), "socket transport service shutdown failed");
>          }
> +        if (mShutdown) {
> +            ClosingService::Shutdown();
> +        }

maybe rather put this to the NS_XPCOM_SHUTDOWN_OBSERVER_ID topic handling code directly?

if there is some reason not to (and have it here rather) please add a comment.
Attachment #8661965 - Flags: review+
(In reply to Honza Bambas (not reviewing) (:mayhemer) from comment #97)
> Comment on attachment 8661965 [details] [diff] [review]
> bug_1152046_move_CLosingService_start_toioservice.patch
> 
> Review of attachment 8661965 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Hmm.. are you sure that nsSocketTransportService::Init is not called just
> once?

I notice that nsSocketTransportService::Shutdown can be called more than one, it is called for profile change as well, and that is a problem :). (Init is called just once)


> 
> ::: netwerk/base/nsIOService.cpp
> @@ +1064,5 @@
> >              NS_ASSERTION(NS_SUCCEEDED(rv), "socket transport service shutdown failed");
> >          }
> > +        if (mShutdown) {
> > +            ClosingService::Shutdown();
> > +        }
> 
> maybe rather put this to the NS_XPCOM_SHUTDOWN_OBSERVER_ID topic handling
> code directly?
> 
I have put it there after SocketTransportService shutdown to let more PR_Close calls get called. Because most of the time they are not causing problem.
check in only for 

bug_1152046_move_CLosingService_start_toioservice.patch
Keywords: checkin-needed
Attachment #8664957 - Flags: checkin+
Dragana, do you have plans wrt to this bug for 42? Thanks
Flags: needinfo?(dd.mozilla)
(In reply to Sylvestre Ledru [:sylvestre] from comment #103)
> Dragana, do you have plans wrt to this bug for 42? Thanks

I was not planning to uplift the last two patches. Although there was no problem with a patch (since 17th Sep. or so).
Flags: needinfo?(dd.mozilla)
OK, I will mark 42 & 43 as wontfix then. Thanks
I was looking at crash some logs again:
I have not seen any PR_Close hangs.
I have seen PR_Connect hangs even though we check for gIOService->IsShutdown() before trying to connect, so hang there for a long time. 
There is also nsAutodial hang (we also do not try to reconnect in case of Shutdown)
And there are some poll hangs.

A thinks that can have influence on this hangs:
avcuf32.dll
rlls.dll
winhadnt.dll
(but not all have one of these)

win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd); just before calling connect(); 
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.c#1160
I could check whether we set nonblocking flag correctly but it is hard to check if it is really set to the os socket.

I have also read some comments people left: there was 2-3 saying that firefox is really slow- I could assume that firefox was slow therefore they tried to shut it down and this crash happened, so probably there already a problem before they clicked shutdown.
See Also: → 1215970
Depends on: 1233237
(In reply to Dragana Damjanovic [:dragana] from comment #106)
> win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd);
> just before calling connect(); 
> http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.
> c#1160
> I could check whether we set nonblocking flag correctly but it is hard to
> check if it is really set to the os socket.

Heads up: ntio.c isn't used by our codebase, as NSPR is configured with the WIN95 configuration. See https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/NSPR_build_instructions#--enable-win32-target.3DWIN95

It looks like the WIN95 configuration does not support this at all.
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #107)
> (In reply to Dragana Damjanovic [:dragana] from comment #106)
> > win sockets are made nonoblocking using rv = _md_MakeNonblock((HANDLE)osfd);
> > just before calling connect(); 
> > http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/ntio.
> > c#1160
> > I could check whether we set nonblocking flag correctly but it is hard to
> > check if it is really set to the os socket.
> 
> Heads up: ntio.c isn't used by our codebase, as NSPR is configured with the
> WIN95 configuration. See
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/
> NSPR_build_instructions#--enable-win32-target.3DWIN95
> 
> It looks like the WIN95 configuration does not support this at all.

We do set PR_SockOpt_Nonblocking = true on the socket in nsSocketTransport::InitiateSocket(), but looking into the code how that is handled, I cannot find we actually would set any option directly on the system socket.  Only thing we do is a setup of secret->nonblocking = true flag on our FD at prmapopt.c!_PR_SocketSetSocketOption.  This flag however doesn't lead to any other setting made on the socket (w95sock.c).

Dragana, we could have a private patch for NSPR to copy that particular piece of code from ntio.c to w95sock.c to try.  Submitting that directly to NSPR should not be that hard if found working.
This is short update.

w95sock.c always makes sockets nonblocking:
http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/w95sock.c#63

and blocking mode is made inside nspr, so for each read/write it loops if a socket is blocking.

I will dig into this further.
(In reply to Dragana Damjanovic [:dragana] from comment #109)
> This is short update.
> 
> w95sock.c always makes sockets nonblocking:
> http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/md/windows/
> w95sock.c#63

I knew it was somewhere!  Thanks Dragana.
Depends on: 1238910
Depends on: 1239655
(In reply to Dragana Damjanovic [:dragana] from comment #111)
> I notice couple of crashes on PR_ConnectContinue:

In these three cases it seems to have stopped here: http://hg.mozilla.org/releases/mozilla-beta/annotate/7ccdff6c2b81/nsprpub/pr/src/io/prsocket.c#l294 .. which is a getsockopt() call on the actual underlying socket descriptor.
(In reply to Daniel Stenberg [:bagder] from comment #112)
> (In reply to Dragana Damjanovic [:dragana] from comment #111)
> > I notice couple of crashes on PR_ConnectContinue:
> 
> In these three cases it seems to have stopped here:
> http://hg.mozilla.org/releases/mozilla-beta/annotate/7ccdff6c2b81/nsprpub/pr/
> src/io/prsocket.c#l294 .. which is a getsockopt() call on the actual
> underlying socket descriptor.

I know, but when we are in shutdown we can just make it fail here.
I notice that we are doing some reads on d socket when we are closing nsHttpConnection:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnection.cpp#599

If we need to reduce amount of operation that we do when we are in shutdown, we can skip this. I have not seen any crash on this, but I am writing it here just to keep it in mind if we need it.
Depends on: 1239961
Depends on: 1240269
Depends on: 1240481
Depends on: 1241295
Whiteboard: [necko-active]
I am going to close this bug. Bug 1158189 is used as meta bug. We are going to open separate bugs for similar issues for simplicity, since this one is overflooded.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
You need to log in before you can comment on or make changes to this bug.