Closed Bug 1297354 Opened 8 years ago Closed 8 years ago

Experiment with WSAEventSelect/WSAWaitForMultipleEvents for socket polling

Categories

(Core :: Networking, defect)

All
Windows 8.1
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox51 --- affected

People

(Reporter: mayhemer, Assigned: mayhemer)

References

Details

Attachments

(1 file, 3 obsolete files)

Attached patch v1 (obsolete) — Splinter Review
So, this is a polling code that uses better polling techniques available in windows 8.1+.  Those don't suffer from blocking inside select() and completely eliminate the pollable event (local socket).

How does it work:
- the use is conditional, building only on windows, engaged only when the API is available
- if available, polling/idle/active arrays in nsSocketTransportService are not used
- instead this all is handled by (forwarded to) the new code
- PollingPool, is the "public API" and singleton, nsSocketTransportService calls on it in most cases like AttachSocket, Run, Poll etc...
- it manages the threads (we need more than 1), and also balances their use with default pool size of 6 threads as a benefit
- PollingThread, is a single thread taking 63 sockets tops, doing the actual select/wait/enum loops 
- the tricky part was to preserve behavior of PR_Poll, which filters all fds in-flags by call to its fd->poll() "method", which may mishmash with the flags in and out, and may also artificially signal an event on the socket
- this is heavily used mainly by ssl sockets
- hence, on every event loop we do this filtering (as I call it) over all sockets (this is no worse than the original code) ; the result of fd->poll() may be influenced by an even unrelated to the socket that distantly changes state of e.g. the ssl machine attached to the socket (like cert verification completion)
- APollerContext/PollerContext is a helper class hanging of off nsASocketHandler
- we need to add some extra data the win-poller needs (because of its async nature) and also because we need to call back from the socket code back to the poller because of missing writability notification of the new win polling API
- the new API does signal writability on a socket just after it's been connected and then after send() == WOULDBLOCK and send buffer is freed again ; hence, we need to track the socket is writable manually via a flag, which is dropped when PR_Send returns WOULDBLOCK (I didn't find a way to track this directly on the TCP socket, but this "hack" seems to work pretty well anyway)

Everything else should either be clear from the comments or the code.


https://treeherder.mozilla.org/#/jobs?repo=try&revision=21b665054e5f
https://treeherder.mozilla.org/perf.html#/comparechooser?newProject=try&newRevision=21b665054e5f
Attachment #8789548 - Flags: review?(mcmanus)
Please note, that few things need yet to be finished, like handing errors returned by the WSA functions, test if timeout on sockets is implemented correctly and maybe few other details.  But I think the patch is otherwise mature enough to ask for r and not just for f.
And as reading the main new header, definitely some of the comments details are outdated.  Hope you won't be too much confused.
Comment on attachment 8789548 [details] [diff] [review]
v1

Review of attachment 8789548 [details] [diff] [review]:
-----------------------------------------------------------------

This is interesting for sure. I'll read it tomorrow. I want dragana to read it too.

thanks!
Attachment #8789548 - Flags: review?(dd.mozilla)
I really do appreciate this effort.. and I think this could drive the hang rate down. of course its not going to help with win7/xp or the connect() hangs so its not a panacea.. but worth trying.
Comment on attachment 8789548 [details] [diff] [review]
v1

Review of attachment 8789548 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/WindowsPolling.cpp
@@ +13,5 @@
> +#include "nsThreadUtils.h"
> +
> +namespace mozilla {
> +namespace net {
> +namespace win {

I don't think we need more namespaces here and would prefer we not draw more divisions within netwerk if we don't need to..

@@ +76,5 @@
> +  Atomic<bool, Relaxed> mSocketWritable;
> +  Atomic<uint32_t, mozilla::ReleaseAcquire> mInFlagsRead, mInFlagsWrite;
> +  TimeStamp mLastEventTime;
> +
> +  virtual void WriteBlocked() override;

a comment here on when this needs to be called

@@ +140,5 @@
> +
> +  // This flag is only used to signal detach synchronously (makes any event
> +  // on the socket scheduled before detach be ignored ; for compat with the
> +  // single thread model only)
> +  context->mDetached = false;

this should be in the ctor

@@ +144,5 @@
> +  context->mDetached = false;
> +  // Assume the socket is not writable until we get the first FD_WRITE notification
> +  // We need this flag because FD_WRITE is notified only the first time and then
> +  // after send() > WOULDBLOCK and send buffer is freed.
> +  context->mSocketWritable = false;

this should be in the ctor

@@ +156,5 @@
> +      return NS_OK;
> +    }
> +  }
> +
> +  RefPtr<PollingThread> poller = NewPoller();

need to null check that one especially

@@ +255,5 @@
> +  MOZ_ASSERT(IsCurrentThread(aPoller));
> +
> +  RefPtr<PollingThread> poller(aPoller);
> +  RefPtr<PollingPool> self(this);
> +  Unused << mSTS->Dispatch(NS_NewRunnableFunction([poller = Move(poller), self = Move(self)]() -> void

oh my :)

@@ +776,5 @@
> +    WSANETWORKEVENTS selected;
> +    rv = mPool->mWSAEnumNetworkEvents(context->mSocket, mEvents[i + 1], &selected);
> +    if (rv == SOCKET_ERROR) {
> +      printf("WSAEnumNetworkEvents: %d\n", WSAGetLastError()); // TODO
> +      MOZ_CRASH("!");

todo :)

@@ +784,5 @@
> +    int16_t outFlags = 0;
> +
> +    if (selected.lNetworkEvents & FD_READ) {
> +      if (selected.iErrorCode[FD_READ_BIT]) {
> +        outFlags |= PR_POLL_EXCEPT; // TODO tune

I don't understand wwhat can be tuned

::: netwerk/base/nsSocketTransportService2.cpp
@@ +242,5 @@
>  nsresult
>  nsSocketTransportService::DetachSocket(SocketContext *listHead, SocketContext *sock)
>  {
> +    // This method is unused when winpoll is used, it detaches sockets its own way
> +#ifdef XP_WIN

you might just define sWinPoll to be void * and null if !XP_WIN, so that these asserts can be less lines and cross platform

@@ +570,5 @@
>  
>      if (mShuttingDown)
>          return NS_ERROR_UNEXPECTED;
>  
> +#ifdef XP_WIN

nspr logging here to tell us which system is being used

@@ +874,5 @@
> +
> +        sWinPoll->PollFilter();
> +
> +        MutexAutoLock lock(mLock);
> +        if (mShuttingDown) {

ah. what could go wrong :) ?

(nspr log)

::: netwerk/base/nsUDPSocket.cpp
@@ +134,5 @@
>    if (count < 0) {
>      PRErrorCode code = PR_GetError();
> +#ifdef XP_WIN
> +    if (code == PR_WOULD_BLOCK_ERROR && mSocket->mPoller) {
> +      mSocket->mPoller->WriteBlocked();

i seem to recall some rtc code that uses sockettransport service but not the stream or udp outputstreams.. that might need an update. ask :drno I think
Attachment #8789548 - Flags: review?(mcmanus) → review+
(In reply to Patrick McManus [:mcmanus] from comment #6)
> Comment on attachment 8789548 [details] [diff] [review]
> v1
> 
> Review of attachment 8789548 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: netwerk/base/WindowsPolling.cpp
> @@ +13,5 @@
> > +#include "nsThreadUtils.h"
> > +
> > +namespace mozilla {
> > +namespace net {
> > +namespace win {
> 
> I don't think we need more namespaces here and would prefer we not draw more
> divisions within netwerk if we don't need to..

I think it's either this (more elegant IMO) or prefix all class/specifics names with the "Win" prefix, do you really want that?

> > +  Unused << mSTS->Dispatch(NS_NewRunnableFunction([poller = Move(poller), self = Move(self)]() -> void
> 
> oh my :)

C++14 :)

> > +#ifdef XP_WIN
> 
> you might just define sWinPoll to be void * and null if !XP_WIN, so that
> these asserts can be less lines and cross platform

Was thinking about it.  Having the #ifdefs all around make more clear "this is windows specific".  But looks ugly...

> > +        MutexAutoLock lock(mLock);
> > +        if (mShuttingDown) {
> 
> ah. what could go wrong :) ?

Not sure what you mean?
(In reply to Patrick McManus [:mcmanus] from comment #6)
> > +      mSocket->mPoller->WriteBlocked();
> 
> i seem to recall some rtc code that uses sockettransport service but not the
> stream or udp outputstreams.. that might need an update. ask :drno I think

https://dxr.mozilla.org/mozilla-central/search?q=ref%3AAttachSocket&redirect=true
What's funny is that there is no way to test this than to simply ship this.  We don't have win8.1 try images, only 8.

I can only check this locally.  I will only produce a build and use it for a while on all my machines to catch any potential problems.
Comment on attachment 8789548 [details] [diff] [review]
v1

Review of attachment 8789548 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #8789548 - Flags: review?(dd.mozilla) → review+
Attached patch v1 -> v1.1 (obsolete) — Splinter Review
Main changes:
- WriteBlocked() checks whether the OS socket really would block or not by calling send("", 0) on it ; this correctly detects the blocking state and doesn't seem to send a zero sized packet (if that is even possible) ; fixes test_tls_server.js
- introduced a code allowing to freeze a polling thread for a particular fd when that fd is being altered or closed on the STS thread ; fixes test_socks.js

Minor changes:
- comments,ctor,constness cleanup
- todo's resolved
- shutdown logic fix (the proper place to stop the poller threads found)
- few more ASocketHandler impls updated to call WriteBlocked appropriately
- about:networking support
- OS socket is no longer cached in the context, since it can be exchanged (socks code does it)

- I left the ::win namespace since I think it's nicer than having Win* prefix on each class ; please speak up if this is something we definitely don't want in the Necko code (I personally think of playing with more platform specific polling code, like EPOLL on linux, and also having more poll threads with reasonable granularity, in general ; hence, namespaces can save some typing)
Attachment #8789548 - Attachment is obsolete: true
Attached patch v1.1 (obsolete) — Splinter Review
There are more changes, so I think it needs a seconds review round.  Please see the interdiff.

I'm going to experiment with the patch directly:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee4809f65d0a
Attachment #8791945 - Flags: review?(mcmanus)
Attachment #8791945 - Flags: review?(dd.mozilla)
Comment on attachment 8791945 [details] [diff] [review]
v1.1

Review of attachment 8791945 [details] [diff] [review]:
-----------------------------------------------------------------

just in time for a new nightly!
Attachment #8791945 - Flags: review?(mcmanus) → review+
Attachment #8791945 - Flags: review?(dd.mozilla)
Attachment #8791945 - Flags: review+
Thanks for the quick reviews!

Note that there is no test coverage for this feature by our build infra.  This is Windows 8.1 only feature.  I'm testing it on Win10 locally for few days, no problem at all found so far.

The try build only checks this builds correctly on all platforms:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2e52d09b4a0e
Keywords: checkin-needed
Keywords: checkin-needed
(In reply to Honza Bambas (:mayhemer) from comment #14)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=2e52d09b4a0e

There is a very very weird crash happening:
https://treeherder.mozilla.org/logviewer.html#?job_id=27729856&repo=try#L35703

Needs investigation first.  It might be that mThread must be accessed only under the lock, still the crash seems to me impossible from simply looking at the code.

And btw, apparently there is some coverage on our infra since the buildbot is server 2008 with support of this new api!
> 
> And btw, apparently there is some coverage on our infra since the buildbot
> is server 2008 with support of this new api!


:) !
Fixing the leak, still seems like some tests are timing out:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=31878d58d886
dragana were just chatting and we recalled some of the LSP complications we had when closing sockets off the socket thread.. I think we need to be aware of that with this select-on-thread-pool strategy as well.. unfortunately I don't have any ideas on how to test it - more something to be aware of as a 3rd party complication.
(In reply to Patrick McManus [:mcmanus] from comment #18)
> dragana were just chatting and we recalled some of the LSP complications we
> had when closing sockets off the socket thread.. I think we need to be aware
> of that with this select-on-thread-pool strategy as well.. unfortunately I
> don't have any ideas on how to test it - more something to be aware of as a
> 3rd party complication.

Closing of sockets is still happening on the STS thread.  As well as reads and writes and any calls to IO functions on the socket, except the new polling APIs.

We'll see.
(In reply to Honza Bambas (:mayhemer) from comment #15)
> (In reply to Honza Bambas (:mayhemer) from comment #14)
> > https://treeherder.mozilla.org/#/jobs?repo=try&revision=2e52d09b4a0e
> 
> There is a very very weird crash happening:
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=27729856&repo=try#L35703
> 
> Needs investigation first.  It might be that mThread must be accessed only
> under the lock, still the crash seems to me impossible from simply looking
> at the code.
> 
> And btw, apparently there is some coverage on our infra since the buildbot
> is server 2008 with support of this new api!

We simply assign to mThread too late, whole thread start (NS_NewNamedThread("Socket Thread")) must happen under the lock.
Depends on: 1305025
OK, removing the dependency again.  After discussion with Dragana I reintroduced the event processing limit to only process events dispatched up until the point we got an event (there is no notion of poll leave in the multithreaded architecture, and it's the closest approximation).

If we process all pending events in an endless while(pending), we may end up looping indefinitely because of bug 1305025 and never update polling flags by a call to PollFilter() - we never reach it.  OTOH, processing only a single event between PollFilter()'ing is 1) sub-optimal 2) breaks netwerk/test/unit/test_range_requests.js test.

The new code only waits (blocks) until an event is dispatched.  It's then processed and wakes the STS thread.  If that event schedules anything, it will be processed in a loop, but only that.  Then we do PollFilter()'ing again.
No longer depends on: 1305025
Attached patch v1.2Splinter Review
- fixes a shutdown leak (just release of all references to mSTS and all thread to break cycles)
- handler's mCondition checked before we check if the handler is polling for anything (led to miss of propagation of an error state back to the socket handler when it was not actively polling for anything)
- NS_NewNamedThread("Socket Thread") made under the lock to not race on mThread read on the sts thread
- process only recent events on STS between calls to PollFilter (which updates poll flags and signal detach on condition failure, see previous comment why)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cf6954b93ad (on top of the pipe patch)
Attachment #8791944 - Attachment is obsolete: true
Attachment #8791945 - Attachment is obsolete: true
Attachment #8794236 - Flags: review+
Comment on attachment 8794236 [details] [diff] [review]
v1.2

Tests still failing...
Attachment #8794236 - Flags: review+
The reason tests are timing out is that there is one more "special" behavior of this API:  

When you poll a socket for reading, you always get the notification.  But, when your reaction to that notification is not to read from the socket, next time you ask if there are some data to read the API will remain quiet and never notify again - despite there are data pending in the socket's read buffer!

This can happen when a consumer calls socket input stream's asyncWait() later than the data has been captured on the socket.  The notification well be forgotten (bypassed) on the condition at [1].  select() always notifies when there are data to read, so the problem doesn't manifest.

There is a way to overcome this simply by having "I've read something from the socket" method on the context object that will notify after every successful read.  More places to put this at, more complications, more sensitive to mistakes...

Probably a better way would be to add my own FD layer that would filter usage of the TCP socket, capture both would_block on writes and successful reads.

I haven't decided if I want to go that way, also because I never know what other surprise I can discover next...

[1] https://dxr.mozilla.org/mozilla-central/rev/c55bcb7c777ea09431b4d16903ed079ae5632648/netwerk/base/nsSocketTransport2.cpp#1964
OK, the problem is 'slightly' more complicated:

- we poll a socket for reading
- it's singled
- we drop the FD_READ in flag in the poller's internal flag set
- we dispatch OnSocketReady with outflags = 1 to the STS thread
- but before it's processed on the STS thread, we process more events what causes us to PollFilter for the handler having a notification pending "in the fly"
- that handler didn't receive the OnSocketReady call yet, hence it still has the poll flag set for READ
- the call to PollFilter resets READ in-flag of the poller's internal set
- later, OnSocketReady with outflags = 1 is called
- this drops the in-flags to ~READ
- but (bug#1) this is not reflected in the poller's own in-flags (READ is there left)
- when some new data arrive sooner on the socket than asyncWait on the input stream is called again, we fire the notification from the poller thread to nowhere: no body expects to get the notification (this is bug#2)

If we fix bug#1 - reset poller's poll flags to 0 after OnSocketReady, we still may get the "deadlock" situation with getting READ notification when not expecting it in socket transport.  Actually, from how the code works for select(), it must never happen we get READ notification when there is no listener for it and in poll flag is not set in the socket transport.

Hence, the solution is far more complicated if there is an async thread polling thread and OnReady events are queued.

This then opens a question if we should start experimenting with more than one STS thread, something I'd like to do one day anyway.
Whiteboard: [necko-active] → [necko-next]
it sounds like you're describing the new api as edge triggered whereas the traditional poll() api is level triggered.

I agree that's going to be a huge mismatch - we have lots of code that might do a short read.. you can turn the edge into level if you have a peek() mechanism that works reliably.
The API is too different to be integrated to an application based on behavior of select().  Too much work and risk for not that big benefit.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Whiteboard: [necko-next]
Duplicate of this bug: 1850164
No longer duplicate of this bug: 1850164
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: