SimplePush: Add "redirect" response to "hello" websocket packet

RESOLVED WONTFIX

Status

()

Core
DOM: Push Notifications
RESOLVED WONTFIX
5 years ago
2 years ago

People

(Reporter: jrconlin, Assigned: lina)

Tracking

unspecified
Future
x86_64
Windows 7
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

5 years ago
In order to allow for sharding and other server side features, the protocol needs the ability to send users to a specific resource. 

Background:
Websocket protocol does not allow for server side redirects. Any attempt by the server to return a 30* response to a websocket client results in the websocket handler to fail the connection and abort. Normally, the approach that is suggested is that both parties negotiate using HTTP in order to determine the websocket connection to use, however because we wish to keep the number of potential connections from a mobile device to a minimum, SimplePush was originally designed to go to a single entry point via WebSocket first.

Suggested Change:
A method to resolve this issue without breaking existing clients is to extend the "hello" response to include a "redirect" field.

e.g. 
client:
{"messageType": "hello", "uaid":"abc123"}
server response
{"messageType": "hello", "uaid":"abc123", "status":302, "redirect", "wss://..."}

Upon receipt of the "status":302 message, the client disconnects from the initial server, and reconnects to the specified "redirect" url. 

A client should safeguard against excessive redirecting, and should terminate after 5 redirection requests. 

If a client receives a redirect request, and takes no action, the server cannot be considered reliable and the client may not receive proper notifications. In addition, registration requests may fail.
Assignee: nobody → nsm.nikhil
(Reporter)

Comment 1

5 years ago
Just to be clear, after 5 redirection requests, the client should go back to wss://push.services.mozilla.com. 

In addition, Server or Ops should monitor to see if the same UAID which has been redirected to a different host has returned back to push.services.mozilla.com frequently within a short period of time. That may indicate a distribution error or faulty server.
Created attachment 783857 [details] [diff] [review]
Add "redirect" support to SimplePush WebSocket.
Created attachment 783861 [details] [diff] [review]
Add "redirect" support to SimplePush WebSocket.
Comment on attachment 783861 [details] [diff] [review]
Add "redirect" support to SimplePush WebSocket.

A big chunk moves code around for readability, but it has a line or two of redirect related code (_redirectsFollowed + comment, calling resetServerURL() in init()).
Attachment #783861 - Flags: feedback?(jrconlin)
(Reporter)

Comment 5

5 years ago
Comment on attachment 783861 [details] [diff] [review]
Add "redirect" support to SimplePush WebSocket.

Review of attachment 783861 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #783861 - Flags: feedback?(jrconlin) → feedback+
Attachment #783861 - Flags: review?(justin.lebar+bug)
Sorry if I'm being dumb here, but I don't see why we need this, nor why it helps with the problem as identified.

I don't get why we need this because why can't we use TCP-level load-balancing?  This would work just like a NAT.  The hardware load-balancer keeps track of incoming TCP connections and maps them to a group of servers.

But suppose load-balancing isn't possible for some reason, and any request to push.mozilla.org must connect to exactly one server.  If so, I don't understand why this solves the scaling problem as described.  In this case all clients will have to connect to one push server in order to get their |redirect| packet.  That push server's job won't be particularly lightweight; it's going to have to set up a TLS connection with the client and so on.  So I don't think this system scales much better than the current one.

bsmith, you know more about networking than I do, so please tell me if I'm out of my mind.
(Reporter)

Comment 7

5 years ago
Customers (or us if we're using provided VPC environments like AWS) do not allow you to have direct control over any load balancer they may offer. Because of that, TCP level load balancing may simply not be possible. On the plus side, such load balancers offer TLS termination, allowing operations greatly simplified keep.

Since Websock drops at any attempt at a traditional redirect, this is an attempt to stay within protocol and limitations of hosting while preventing server overloading.
But if you don't have TCP-level load-balancing, then you still have the problem of all new clients going through one bottleneck server, right?  I have difficulty believing that this is good for scaling.

I'm not trying to have stop energy here, and I understand how this is an improvement, but lots of little incremental improvements are expensive in a protocol.  If this is a problem, I think it's worth considering how to fix it for real.
It's also not clear to me why http://aws.amazon.com/elasticloadbalancing/ doesn't work if you're on AWS, but I didn't read too closely.
(Reporter)

Comment 10

5 years ago
ELB works great on AWS and it's what we use. It also does round robin assigns to available machines with the option to spin up additional machines. It's very simple because it's designed for general use and the vast majority of AWS sites are REST like using short term HTTP connections, or reasonably trivial apps that have only a few thousand connections. 

In short, AWS REALLY doesn't like long term connections, and makes things rather difficult for sites that try to do this. (Netflix created their own load balancers in order to deal with some of those same issues. see "Eureka" if you're interested in one of them.) 

It would be delightful not to have to require this. Personally, I would have preferred the protocol perform HTTP request first to get the wss host to connect to, but that ship has long since sailed. So this is the next option.
> (Netflix created their own load balancers in order to deal with some of those same issues. see 
> "Eureka" if you're interested in one of them.) 

And it's even open-source.  Did we consider using that?

> So this is the next option.

I still don't feel like my point that this is at best an incremental change which doesn't actually let us scale is being addressed.

If we make this change, we're stuck with it for a long time.  I'm asking that we do diligence to ensure that

a) This change actually fixes the problem at hand, and
b) This is the best solution we can come up with.

I'm not convinced that either of these is true.

> In short, AWS REALLY doesn't like long term connections, and makes things rather difficult for 
> sites that try to do this.

This is a tangent, but given that our entire protocol is based around long-lived connections, if this is true in ways other than the load-balancing issues we're discussing here, then perhaps AWS is the wrong choice for us.
(Reporter)

Comment 12

5 years ago
Eureka relies on a rather large layer cake of java code and configuration. It's a bit like saying that there's a really neat sorting routine available for Apple2s that uses A Traps and we should totally use it in b2g. 

This change is effectively a minor, optional element that allows operational control over connectivity. For most small implementers (who will never have more than 200K customers connected at one time), this will almost never be used. 

That said, this was not a decision made after 5 minutes and a six pack. This proposal was created after discussion with our Ops folk who would be the folks being woken up at 3AM. I'm happy to hear other possible suggestions, of course, but within the confines of the restrictions we have and my general aversion to doing incredibly complicated things rather than propose simple corrections, I do feel that this is the best solution to the problem at hand. 

As for the AWS tangent, no, actually, it's not. There are cost considerations going to scale for this product that should not be ignored. Right now, AWS is considerably more cost effective for platform delivery, so there's very strong encouragement (from above my pay grade) to try and work within that system. 

I'm well aware of the challenges of working with that system. I'm also well aware that it takes more than just buying a machine and finding a hidey-hole we can stick it in a starbucks if we want this to be reliable. (There's colo costs, redundancy issues, and lots of other factors that go into cloud making.) I am following up with non-AWS options as much as I can, but it is not relevant to this bug.

I do understand your goal, and I appreciate your playing devil's advocate. I apologize if I seem snarky as I tend to get that way when I'm tired and I tend not to recognize when I'm being that way.
> I do feel that this is the best solution to the problem at hand. 

I understand that this is the best you've come up with, but that's separate from the question of whether or not it actually solves the problem.  I still don't feel like that question is being addressed.

Do you think that having one machine which will handle all incoming connections won't be a scalability bottleneck and a point of failure?  It seems to me that this is just asking for us to be bitten later on, once we actually have users.
Thinking about this a bit more:

WRT round-robin scheduling, I'd think that even with long-lived connections, the law of large numbers will ensure that all servers have very close to the same number of connections, assuming you have tens or hundreds of connections per server.

The bigger problem seems to be that, if you spin up a new server, you want to send it traffic preferentially.

Is that right?

It seems that spinning up (and also, in fact, differences in observed load) could be handled by adding / removing servers from the round robin schedule when they're overloaded.  (Note that if we're doing our own load balancing, we also need some way not to send traffic to overloaded servers, so either way we need a method for detecting that a server is overloaded.)

The Amazon load-balancing API seems to let you dynamically add/remove servers.  Is the issue with this that removing a server closes any existing connections that were established through the load balancer?  If so, that's pretty lame...
(Reporter)

Comment 15

5 years ago
I think you are presuming that tge machines behind the central entry point would be doing anything more than routing requests to boxes. for very low levels of overall traffic, this may be thw casem however for large scale traffic, thwre's little reason to do that. Themachines would simply route xonnections to other boxes and effectivly be far smarter about loas. We are already working on that.

As for the round robin, this also preaumes that connections are equally lived. That is not the case. in a round robin system, a server could easily become saturated by long lived connections while other servers starve. We can drop servers from ELB, but that could also sever those connections causing a small herd storming other boxes and possibly triggering a cascade. Being able to direct clients lets us be proactive rather than reactive.
Good to see discussion of possible alternatives going on.

There's a couple of other challenges that round-robin load balancing present:

1) We may need to shunt users between multiple locations. Imagine a situation in the future where we start to use a second AWS site (something in Asia, maybe). We'd need a way to move users over there, and a redirect is a decent option.

2) There are a number of advantages to keeping a user coming back to the same box consistently. That can't be done with a round-robin approach.

Ignoring everything else, having the ability to redirect users is something that comes at minimal cost to us (though I'm still irked that websockets won't do 302), and provides a nice safety valve in case we do need to use it. If it turns out we don't because we find something better in the future, that would be fantastic.
(Reporter)

Comment 17

5 years ago
(In reply to JR Conlin [:jrconlin,:jconlin] from comment #15)
> I think you are presuming that tge machines behind the central entry point
> would be doing anything more than routing requests to boxes. 

Wow, typing sucks on an android. My apologies for the spelling mistakes.
> As for the round robin, this also preaumes that connections are equally lived.

I think I addressed this earlier.  To be super-concrete: My point does not presume that connections are equally lived.  It presumes only two things:

* the law of large numbers, and
* that round robin is equivalent to assigning connections to a machine uniform at random, so that the machine you're assigned to is independent (in the statistical sense) from the expected connection length.

If you spread tens or hundreds of thousands of connections at random across a pool of machines and then examine the server state X hours later, it is highly unlikely that we would see major variations in the number of live connections on each server, even if the standard deviation of a connection's length is large.  Cook up a simulation and try it for yourself, if you like.

http://en.wikipedia.org/wiki/Law_of_Large_Numbers

I've asked a few times and nobody has argued that the proposal here actually makes the protocol scale without a TCP load balancer, because we still have a single machine as a bottleneck for all incoming connections.  I guess we all agree on this point?

> 2) There are a number of advantages to keeping a user coming back to the same box 
> consistently. That can't be done with a round-robin approach.

The idea is for this one box that's a central failure point and throughput bottleneck and have it maintain a persistent lookup table mapping every active user to a machine?

> 1) We may need to shunt users between multiple locations. Imagine a situation in the 
> future where we start to use a second AWS site (something in Asia, maybe).

As I read its description, the AWS load balancer specifically says that it can do this.

> 2) There are a number of advantages to keeping a user coming back to the same box 
> consistently. That can't be done with a round-robin approach.

I believe the canonical way of doing this in distributed systems is to add a layer of indirection.  You have web-facing boxes which are load-balanced at random, and then these boxes talk to a set of boxes across which users' data is shareded.  I believe this is the design that the Thialfi paper describes.

The advantage of this over the approach where one box hands out redirects to the user data servers is that it scales and is fault-tolerant.

> Ignoring everything else, having the ability to redirect users is something that comes at 
> minimal cost to us (though I'm still irked that websockets won't do 302), and provides a 
> nice safety valve in case we do need to use it.

If we all agree that redirects don't solve our scaling problem, this doesn't seem like much of a safety valve.

I really disagree that this is a low-cost thing.  We're adding yet another layer of complexity to a protocol we expect to be widely-adopted and live for a long time.  Protocol design, like all design, is about saying "no" to almost-perfectly-good ideas.

I agree that we have a problem, and I want to solve it too.  But I don't want us to bake redirects into a protocol we expect to live forever if that's not going to solve our problem.

Note that you have options beyond convincing me.  Just get dougt to review this change.  I will not be offended in the least.
(Reporter)

Comment 19

5 years ago
I believe there's a few things that may not have been communicated clearly. 

1. There's no single machine bottleneck.
In essence each "machine" is a cluster that will allow us to swap in new boxes as need be, this includes the central "push.services.mozilla.com" which is the default host for clients to connect. However, as we've learned from systems like Y!Mail and Broadcast.com platforms used for Victoria's Secret webcasts, randomness can also be your worst enemy.

2. While random assignment will eventually distribute evenly, it's not the only factor at play.
As you note in #14, a single machine (even in a given cluster) can become overloaded. The current policy would be to disconnect all users of that machine (max 200K) and allow them to reconnect. That may lead to customers feeling that the service is unstable because it can't keep their connection open reliably. This allows us to pro-actively manage loads, as well as direct customers to clusters that are more friendly to them (e.g. a cluster that is geographically closer to their most common access point, or has better reliability because of a scheduled outage, or a number of other potential issues). 

3. We are trying not to be platform dependent.
While it's certainly possible to base our entire platform off of a single vendors set of services, not all customers are willing or able to run a similar config. By designing the simplest possible system (similar to HTTP, SMTP and the like), we allow the highest level of adoption.

As I've noted, there is a great deal of experience and lessons learned going into this request. A change like this provides the most transparent and easily implementable fix to a number of very difficult issues, for minimal cost. Also, by "cost", I mean the cost of purchase, maintenance and operation of machines that will need to be monitored constantly to ensure a reliable customer facing service. If we can save thousands of dollars of development and operational cost by introducing a few lines of code before general release, that seems to be a clear savings.
I feel like comment 19 changes a lot of the parameters of the argument.  For example, if we're capable of putting multiple machines behind "push.services.mozilla.com", it is not clear why we need redirect capability.

> As I've noted, there is a great deal of experience and lessons learned going into this 
> request.

I feel like the request here is, "please r+ this because we know what we're doing."

That's not how I'm used to doing things, but I'm sure dougt can handle it.
Attachment #783861 - Flags: review?(justin.lebar+bug) → review?(doug.turner)
Just to throw out another idea: If what we really want is HTTP pre-negotiation, I think we can do that without affecting our partners' implementations.

Right now we have wss://push.mozilla.org, and our partners have wss://push.partner.com.  We could spec that https://push.mozilla.org acts as an HTTP redirector, while the websocket protocol remains unchanged.

I guess a downside of doing this would be that if partners want to adopt redirects, they would have to change their push endpoint URL.  I'm not sure this matters, since we don't seem to be designing this for partners.  Another downside would be that if you specify https, you're stuck with that; you can't switch to WSS TCP load-balancing later, if you wanted to, so you eat two extra round trips per connection forever.

I'm not convinced that this is better than simply TCP load-balancing the WSS servers (and using e.g. IP anycast if you want to direct users to geographically-close servers), but you guys did say earlier you would prefer to do it this way.
(Reporter)

Comment 22

5 years ago
There are several issues at play here, and I don't know if they've been communicated fully yet.

One of the larger ones is that Amazon doesn't let you softly pull machines out of rotation. That means that once a machine "fills up", it's still a candidate for getting traffic. The problem magnifies because as the connection fails, the client rightly backs off and tries again (starting at 5 seconds, then doubling time to a max of 30 minutes).

In addition, we have a hard limit of 200,000 simultaneous connections per Websocket head. This is a restriction that both we and Urban Airship have discovered independently (UA confirmed it after we asked them). So, if we're serving one million connections, that means there's five saturated machines and new machines have a 1:6 chance of connecting to the open box. Once we cross 8 machines, there's a very good chance that clients may quite some time just to connect to an available socket.

We could simply double the amount of available boxes for connections, of course, but that means that we're overpaying for mostly idle machines, and that defeats some of the cost savings we're trying to take advantage of. Likewise, while running boxes ourselves could also help solve some of these problems, that choice comes with a substantial amount of overhead cost and potential issues in the future.

Likewise, one of the requirements we have been working from is that clients don't have a lot of extraneous data exchanges. 

Naturally, we've been focused on these issues a good deal, and it's never a bad thing to have fresh insight from someone.
(Reporter)

Comment 23

5 years ago
Services has put together a list of potential non-redirect recommendations, including pros and cons, for review here: 

https://etherpad.mozilla.org/SimplePushOperationalFocus
Comment on attachment 783861 [details] [diff] [review]
Add "redirect" support to SimplePush WebSocket.

Review of attachment 783861 [details] [diff] [review]:
-----------------------------------------------------------------

Lets make sure we need to add this to the protocol.  it sounds like we might not need it.  Removing from my review queue until we know.
Attachment #783861 - Flags: review?(doug.turner)
Unassigning myself. If this isn't needed, it can also be closed.
Assignee: nsm.nikhil → nobody
Flags: needinfo?(jrconlin)
(Reporter)

Comment 26

3 years ago
Overcome by events. closing as invalid.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(jrconlin)
Resolution: --- → INVALID
We'll want this for transitioning to the HTTP/2 server.
Assignee: nobody → kcambridge
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Closing this out for now, as the migration timeline is unclear. Also, older clients won't necessarily be able to speak the latest version of the Web Push protocol when we transition. In that case, redirecting them to an H/2 server will do more harm than good.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago2 years ago
Resolution: --- → WONTFIX
Component: General → DOM: Push Notifications
Product: Firefox OS → Core
Target Milestone: --- → Future
You need to log in before you can comment on or make changes to this bug.