Closed Bug 1055139 Opened 10 years ago Closed 10 years ago

Loop client needs to provide hook for load balancing Simple Push servers

Categories

(Hello (Loop) :: Client, defect)

defect
Not set
normal

Tracking

(firefox34+ fixed, firefox35+ fixed)

RESOLVED FIXED
mozilla35
Iteration:
35.1
Tracking Status
firefox34 + fixed
firefox35 + fixed
backlog Fx35?

People

(Reporter: abr, Assigned: abr)

References

Details

(Whiteboard: [loop-uplift][fig:wontverify])

Attachments

(1 file, 3 obsolete files)

Currently, the loop client uses the built-in browser pref "services.push.serverURL" to determine which push server to use. If Loop were to go into release with this configuration, it is very likely that we would overload the simple push server rather quickly. After discussions with the simple push team, one tool that we've settled on to address this situation is to retreive the address of the Simple Push server from the Loop server at client startup. This allows for two important changes: 1) We can deploy dedicated simple push clusters for Loop, probably with certain unneeded and resource-intensive features disabled, and 2) We can set up the Loop server to distribute load among multiple simple push clusters by returning different simple push URLs to clients when they ask for a URL. To provide for a transition path, the current patch falls back to using services.push.serverURL if the attempt to retreive a URL from the Loop server fails. I will be filing a followup bug to remove this behavior (i.e., to fail the connection if the Loop server cannot provide a Simple Push Server URL, the registration attempt fails): we don't want to knock over the simple push server if something starts going wrong handing out URLs.
See Also: → 1055143
Push server fall-back removal bug filed as Bug 1055145
Anthony: just pinging you with a needinfo so you're aware of this feature. We want to land this in 34 so that we can ensure scaling once we get into release -- the simple push server folks have highlighted that we may well run into scaling issues without something like this.
Flags: needinfo?(anthony.s.hughes)
Thanks for the heads up, Adam. James, since this deals with scalability could you take a look at it?
Flags: needinfo?(anthony.s.hughes) → needinfo?(jbonacci)
QA Contact: jbonacci
:ashughes I will be tracking bug 1055143 directly. I will still need help from the client teams to verify changes made to current client-side functionality for this bug and bug 1055145. You can leave me assigned to the bug for now, though... Thanks.
Flags: needinfo?(jbonacci)
Thanks James. I will of course help you with client-side testing when the time comes. Feel free to need-info me.
Mark: I've manually verified this by running it against the dev server, and it does what it's supposed to. I've also induced a number of failures -- some accidentally :-) -- and been through the error paths. Unfortuantely, isolating this behavior to sanity check it in an automated test would be a pretty big rigamarole, so I'm not landing any tests with it.
Attachment #8480211 - Flags: review?(standard8)
Attachment #8474661 - Attachment is obsolete: true
Comment on attachment 8480211 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Tagging Tim and Mark in case they can get to this first -- I'll take the first r+ that comes in. :) This is important for not melting the push infrastructure when Loop/Hello goes live in 34 release. We need to land it before uplift or uplift it into Aurora. It's a lot easier if we don't need to uplift it later. :)
Attachment #8480211 - Flags: review?(ttaubert)
Attachment #8480211 - Flags: review?(mhammond)
Re tests, I think it might actually be fairly simple with the xpcshell test infrastructure we already have in place, but I haven't got time until after the merges to look at it, and I don't think I'd block on it either.
[Tracking Requested - why for this release]: See Comment 8. We need this to make sure that the push infrastructure doesn't suffer when loop goes live. This may land before uplift, but if it doesn't, we need this in Fx34.
Comment on attachment 8480211 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Review of attachment 8480211 [details] [diff] [review]: ----------------------------------------------------------------- Looks reasonable to me. ::: browser/components/loop/MozLoopPushHandler.jsm @@ +229,5 @@ > + let req = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"]. > + createInstance(Ci.nsIXMLHttpRequest); > + req.open("GET", pushUrlEndpoint); > + req.onload = () => { > + if (req.status < 300) { seems almost impossible we'd see a 1XX response, but still, we might as well check for >= 200 @@ +239,5 @@ > + console.warn("MozLoopPushHandler - push server URL config lacks pushServerURI parameter"); > + pushServerURLFetchError(); > + } > + } else { > + console.warn("MozLoopPushHandler - push server URL retrieve error: " + req.status); To be robust, it might make sense to have an exception handler for the JSON parse?
Attachment #8480211 - Flags: review?(mhammond) → review+
Attachment #8480211 - Flags: review?(ttaubert)
Attachment #8480211 - Flags: review?(standard8)
Oops -- closing the network is verboten during tests. Fixing... https://hg.mozilla.org/integration/fx-team/rev/93b6e71b019a
Backed both out in https://hg.mozilla.org/integration/fx-team/rev/95b54a829a57 for still touching the network, now in browser_mozLoop_doNotDisturb.js.
Working theory is that the extra async involved in getting the push server URL gives the mochitest time to put the real prefs back, so that we actually try to use a real pushserver instead of localhost. This is a try of a that patch attempts to bypass the whole mess by marking the system offline during the Loop tests: https://tbpl.mozilla.org/?tree=Try&rev=786df7cfe306
Attachment #8480211 - Attachment is obsolete: true
Comment on attachment 8482759 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Carrying forward r+ from mhammond
Attachment #8482759 - Flags: review+
Comment on attachment 8482759 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Mark: Can I get an r+ from you on the head.js change here? This is the change we discussed in our conversation in IRC.
Attachment #8482759 - Flags: review?(standard8)
Comment on attachment 8482759 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Review of attachment 8482759 [details] [diff] [review]: ----------------------------------------------------------------- r+ for the mochitest parts. ::: browser/components/loop/test/mochitest/head.js @@ +56,5 @@ > // Set prefs to ensure we don't access the network externally. > Services.prefs.setCharPref("services.push.serverURL", "ws://localhost/"); > Services.prefs.setCharPref("loop.server", "http://localhost/"); > + let wasOffline = Services.io.offline; > + Services.io.offline = true; I think maybe add a comment along the lines of: // Turn off the network for loop tests, so that we don't // try to access the remote servers. If we want to turn this // back on in future, be careful to check for intermittent // failures.
Attachment #8482759 - Flags: review?(standard8) → review+
Looks good on try, so let's give this another spin: https://hg.mozilla.org/integration/fx-team/rev/e6cf07180934
Okay, that broke xpcshell again. Backing out *again*: https://hg.mozilla.org/integration/fx-team/rev/2dd9b2f2ef8f
Curse the fact that I'm working on two bugs with extremely similar bug numbers. I think I got my push tests backwards - I pushed '139 with browser mochi tests, and '319 with xpcshell tests, which is backwards from what I wanted. Let's level-set. This is a try push of *this* bug with xpcshell tests. Given that it failed on fx-team, I expect it to fail on try. If not, we have bigger problems to deal with: https://tbpl.mozilla.org/?tree=Try&rev=e65618604015
And now, to be hyper-defensive after bouncing off the tree so many times, here's a full suite of tests run on try: https://tbpl.mozilla.org/?tree=Try&rev=79c6b5703bb8
Okay. That try was as green as tries ever get -- two known intermittent oranges and a really odd Android debug crash that is not plausibly related to my patch. I'm going to call that a clean bill of health; so, once more, with feeling: https://hg.mozilla.org/integration/fx-team/rev/c8dc98469fc0
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Comment on attachment 8484463 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server (as landed) Carrying forward r+ from mhammond, Standard8 Approval Request Comment [Feature/regressing bug #]: Loop [User impact if declined]: Overall capacity of Loop service would be constrained to whatever could be supported by one simple push cluster (~5 million users). After we reach that level, we would need to take steps to disable the Loop service for remaining users, resulting in partial service unavailability. [Describe test coverage new/current, TBPL]: The changes made by this patch have been verified by running it against the http://loop-dev.stage.mozaws.net/ development server, which already has the server-side portion of this functionality implemented. Based on logging messages added to the patch (temporarily, during testing), the function works as planned. [Risks and why]: The risks to taking this patch are very low for the system in general, as the changes are isolated to Loop's push server handling. The risk to the Loop server handling is also low; the maximum impact is that a failure induced by this change may prevent incoming Loop calls from arriving. This would be detected and reported to the user as an error. [String/UUID change made/needed]: None
Attachment #8484463 - Flags: review+
Attachment #8484463 - Flags: approval-mozilla-aurora?
Comment on attachment 8484463 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server (as landed) From speaking with abr: This is a temporary solution until a new server is in deployment. Once the new server is deployed, we're going to want to pull out the logic that falls back to the default server. Aurora+
Attachment #8484463 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
James, can you please have a look at this to determine what testing this needs?
Flags: qe-verify?
Flags: needinfo?(jbonacci)
:ashughes - I will see what I can do. I don't really have full context anymore. I will talk to the Loop-Server and SimplePush Devs for more info about getting the SP (Prod) server info dynamically from Loop-Server Prod ... + tarek + ben Starting with "needinfo" but will probably just have an offline discussion...
Flags: needinfo?(tarek)
Flags: needinfo?(jbonacci)
Flags: needinfo?(bbangert)
:ashughes So, just to clarify some flags. How is this "fixed" for Fx35? What do we want to see for Fx34? services.push.serverURL is defined for Fx33 as well. What are our plans for this release?
Flags: needinfo?(anthony.s.hughes)
status-firefox35:fixed just means that the patch landed in mozilla-central during the Firefox 35 timeframe (comment 26). Request for Aurora 34 uplift was requested (comment 28) and approved (comment 29) but the landing on mozilla-aurora has not yet happened. Based on that my assumption is we want 35 and 34 to be the same here. 33 is a different question though, maybe Adam Roach can answer that?
Flags: needinfo?(anthony.s.hughes) → needinfo?(adam)
James: I think we have everything covered in our functional tests. I guess one interesting thing to do would be to add an assertion in the load tests to verify that we're really doing some load balancing and how effective is the shuffling. (as long as we *do* have several simple push nodes in place)
Flags: needinfo?(tarek)
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #33) > status-firefox35:fixed just means that the patch landed in mozilla-central > during the Firefox 35 timeframe (comment 26). Request for Aurora 34 uplift > was requested (comment 28) and approved (comment 29) but the landing on > mozilla-aurora has not yet happened. > > Based on that my assumption is we want 35 and 34 to be the same here. 33 is > a different question though, maybe Adam Roach can answer that? Loop in FFx33 is only going to be active for beta -- it won't ride out into release, so we don't need this mechanism. Loop in 34 is going to be in release, but only in the customization palette. We're going to want to manage load just in case we see a really big uptake of the feature despite its relative obscurity. Look in 35 (and later) will definitely need this mechanism, as we're going to need the ability to scale across multiple clusters. We will probably be landing a follow-up bug in the 35 timeframe that removes the "fallback to configured value", but we'll wait until the loop server that supports this functionality is deployed first.
Flags: needinfo?(adam)
OK. Thanks for all the updates. So a lot of the initial work/test/verification needs to happen on the server-side. We can not do that until there is an SP Stage and and SP Prod environment. Then we can look at the SP <------> Loop-Server part of the pipeline. After that, services QA can work with client QA to make sure that Loop-Server returns what it is supposed to return for the client requests. Plus the various error handling at both ends and on/from all the services. Fun.
Flags: needinfo?(bbangert)
Flagging this for QE verification based on comment 36. James, please let me know when we're ready for client-side testing.
Flags: qe-verify? → qe-verify+
Whiteboard: [loop-uplift]
Hi James, What's the QE status here? I need to make sure this is verified fixed against Nightly and Fig in the next 24-48 hours.
Flags: needinfo?(jbonacci)
SP test and stage environments are in the process of being setup (https://bugzilla.mozilla.org/show_bug.cgi?id=1072495). In a discussion today with ops there are several cases the client will need to handle: 1) The client will need to change SP cluster its connected to if it becomes 'full'. To do this it will need to re-query the cluster URL to locate a new cluster to connect to, and update Loop-server with its new endpoint URL. 2) For SP cluster updates, ops may either do a rolling restart, or more likely setup a new SP cluster and change DNS to point to the replacement cluster. In this case, the client will need to drop its connection, query DNS for new IP's, and reconnect to the replacement cluster. I'm unaware of exactly what the client does with the SP name it chooses from the URL supplied at the moment, or how we can instruct it that it must re-choose (in event 1 I note).
Richard, since James is on PTO can you have a look at this?
Flags: needinfo?(jbonacci) → needinfo?(rpappalardo)
Whiteboard: [loop-uplift] → [loop-uplift][fig:verifyme]
QA Contact: jbonacci → rpappalardo
I'm not yet clear on what's still needed here, but let me touch base with :kthiessen tomorrow & get back to you
As far as I understand it, verification of this bug depends on bug 1072495 being complete.
< ashughes> rpapa: in reading it further I think this may depend on bug 1072495 -- it may just need to wait until James is back
I'm marking this wontverify for Fig since we're still blocked here.
Whiteboard: [loop-uplift][fig:verifyme] → [loop-uplift][fig:wontverify]
Flags: needinfo?(jbonacci)
Please see https://bugzilla.mozilla.org/show_bug.cgi?id=1055139#c40 https://bugzilla.mozilla.org/show_bug.cgi?id=1055139#c43 https://bugzilla.mozilla.org/show_bug.cgi?id=1055139#c44 So, no. This can not be verified until the SP Stage and Production environments are up and running. Clear whatever fields need to be cleared. We will get to this (both client and server teams) once the environments are done.
Flags: needinfo?(rpappalardo)
Flags: needinfo?(jbonacci)
Marking this as qe-verify- based on comment 47. Please reflag this + if/when this becomes verifiable.
Flags: qe-verify+ → qe-verify-
:ashughes I added some dependencies that will make this QA work easier to schedule...
Depends on: 1082188
:jbonacci - stage is up - can we test with that to verify this as fixed.
Yes, we can test this. I need some coordination with the client teams. I guess we need to make sure that the client (in some version of Fx) can now talk to the stack (Prod). Then we need OPs to verify that the communication is really happening. We also need to verify the configs are changed for loop-client. Not sure if this is built into some Fx release already, or if we need to make an actual change to loop-client (deployed) or standalone. Not sure we can generate enough traffic via client to verify actual SP load balancing. (the Loop-Server load test does not hit SP, nor is it supposed to) If the SP load test is high traffic enough, that will be our "load balancing" verification step. Separately, any automated tests with the config set correctly, should generate "enough" traffic to allow OPs to verify that the Prod ELB is being hit and individual SP instances are being used to handle that traffic. Anything else I am missing? I highly recommend a meeting to get SP Devs, Loop-Client Devs, QA, and OPs around a table to get the quickest/shortest path to verification. Neither Richard or I should be in the QA field for this bug.
QA Contact: rpappalardo → anthony.s.hughes
(In reply to James Bonacci [:jbonacci] from comment #51) > I highly recommend a meeting to get SP Devs, Loop-Client Devs, QA, and OPs > around a table to get the quickest/shortest path to verification. Let's continue this discussion via email. I'll set up a meeting once I get a complete list of people who need to be there.
Flags: qe-verify- → qe-verify+
No longer depends on: 1082188
Cool. Thanks :rpapa Adding the Prod deploy as a dependency for this bug. That is what we really need.
Depends on: 1090504
Adam -- Can you read over Comment 51 (https://bugzilla.mozilla.org/show_bug.cgi?id=1055139#c51) and confirm that the loop client is ready for testing? I believe it is, but you're closer to this than I am. Thanks.
Flags: needinfo?(adam)
(In reply to James Bonacci [:jbonacci] from comment #51) > Yes, we can test this. I need some coordination with the client teams. > I guess we need to make sure that the client (in some version of Fx) can now > talk to the stack (Prod). > Then we need OPs to verify that the communication is really happening. > > We also need to verify the configs are changed for loop-client. Not sure if > this is built into some Fx release already, or if we need to make an actual > change to loop-client (deployed) or standalone. Ready for testing in Nightly 36: http://dxr.mozilla.org/mozilla-central/source/browser/components/loop/MozLoopPushHandler.jsm#290 Ready for testing in Aurora 35: http://hg.mozilla.org/releases/mozilla-aurora/file/afef8d8afd37/browser/components/loop/MozLoopPushHandler.jsm#l227 Ready for testing in Beta 34: http://hg.mozilla.org/releases/mozilla-beta/file/c94fc6b83daa/browser/components/loop/MozLoopPushHandler.jsm#l227 The change is not in Release 33. The standalone client does not use the SP server, so it does not need verification. > Not sure we can generate enough traffic via client to verify actual SP load > balancing. > (the Loop-Server load test does not hit SP, nor is it supposed to) > If the SP load test is high traffic enough, that will be our "load > balancing" verification step. > Separately, any automated tests with the config set correctly, should > generate "enough" traffic to allow OPs to verify that the Prod ELB is being > hit and individual SP instances are being used to handle that traffic. So, I would test this using a couple of different techniques. First, an end-to-end test where you bring up several instances of Firefox, and verify that they're not all connecting to the same simple push server. That will verify that the client and server are interacting with each other properly. Then, I'd write a script that makes many (hundreds or thousands) of requests to https://loop.services.mozilla.com/push-server-config and records how many times each server URL is returned. Each server should appear approximately as many times as every other server. You can probably spot problems by hand; if you want to be rigorous or automate the tests, I would check that the standard deviation was below some reasonable threshold (e.g., stddev < 100 for 10,000 samples). Note that both of these tests assume that the Loop server has been configured to point at several simple push clusters rather than the single "wss://push.services.mozilla.com/" they appear to be configured with right now.
Flags: needinfo?(adam)
James, can you please advise how we can reliably test this given what Adam has said in comment 55?
Flags: needinfo?(jbonacci)
:ashughes we will need to coordinate with the SP Dev team and the OPs team. That is why I wanted them in the meeting. Also, I did not even know this endpoint existed: https://loop.services.mozilla.com/push-server-config.
Flags: needinfo?(jbonacci)
>Then, I'd write a script that makes many (hundreds or thousands) of requests to >https://loop.services.mozilla.com/push-server-config and records how many times each server URL is >returned. Each server should appear approximately as many times as every other server. You can probably >spot problems by hand; if you want to be rigorous or automate the tests, I would check that the standard >deviation was below some reasonable threshold (e.g., stddev < 100 for 10,000 samples). This sounds like a load test, though doing something different than what push-tester does. Do we want to add an additional test for this?
:rpappa - i'd probably just curl the endpoint in a loop and record the value rather than make it more formal. I think the steps to verify this bug as fixed are: 1. Ops or dev need to wire up the load balancing URL is configured to return 50%/50% or (some percentage therein) between two URLs (possibly cnames to the same server). 2. CldSvcs curls that endpoint and gets the percentage distribution over say 100 curls 3. Client QA opens up nightly with Console and confirms percentage of each endpoint are represented.
:rpapa the loop-server load test does not hit the SP server or that endpoint, so normally, an extra test would be needed, as :abr pointed out :-) :edwong has it right - simpler is better
The last bit of my first sentence fell away during copy/paste apparently. :rpapa you are right and I am right - neither load test currently hits this endpoint...
Depends on: 1091100
To be clear, I'm not proposing this as a load test of the loop server -- the repeated check is to get a statistically significant number of samples of the URL it hands out for the loop server. It doesn't matter if you do them "best speed" or over the course of a few hours.
Nominating this to block Loop 35 based on today's SimplePush QA coordination meeting. Maire confirmed this is a nice-to-have for 34 given the softlaunch but a hard requirement for 35. Aiming for 35 will give Ops & QA the time we need to not cut corners in our testing and make sure we do this right.
backlog: --- → Fx35?
Depends on: 1091192
I'm not clear here what the configuration for the server should look like? What are the simple push endpoints we should use here?
Flags: needinfo?(kcambridge)
:alexis, the Loop server, or the client? The Loop server will store the endpoint the client hands it as usual, clients will hand the loop-server endpoints under different domains so the entire endpoint URL will need to be held by loop-server.
Yes, the `/push-server-config` exposed by the Loop server is fine. The client just needs to handle the "redirect" and "cluster full" replies sent by the Push server after the opening handshake. In case the cluster is full, the client will need to make another request to `/push-server-config` and obtain a push URL for the new cluster.
Flags: needinfo?(kcambridge)
Iteration: --- → 35.1
(In reply to Kit Cambridge (:spindrift, :kcambridge) from comment #66) > Yes, the `/push-server-config` exposed by the Loop server is fine. The > client just needs to handle the "redirect" and "cluster full" replies sent > by the Push server after the opening handshake. In case the cluster is full, > the client will need to make another request to `/push-server-config` and > obtain a push URL for the new cluster. I'm a little confused here. The whole point of having the push server endpoint distributed by the loop server was that the loop server could hand out different values to different clients. It sounds like you're trying to do something different than that now?
(In reply to Adam Roach [:abr] from comment #67) > I'm a little confused here. The whole point of having the push server > endpoint distributed by the loop server was that the loop server could hand > out different values to different clients. Yep—but the `/push-server-config` endpoint hands out the hostname of the Push cluster. As I understand it, when the client connects to that URL, it resolves the hostname to a random node in the cluster—and caches the DNS lookup. If that Push node is overloaded, we need to redirect the client to a different one. We're worried that, if we just close the connection, the client will reach into its DNS cache and reconnect to the same overloaded node. It also looks like the client is caching the response from `/push-server-config`: https://github.com/mozilla/gecko-dev/blob/master/browser/components/loop/MozLoopPushHandler.jsm#L289-291 If the entire cluster becomes overloaded, the client will need to requery that endpoint for a new Push cluster hostname.
I thought we would return a different node domain name, not the same all the time, otherwise there is almost no point doing that. So, are you saying the round robin is handled directly by DNS lookup? Also,how do we know what are the values that should be configured on the loop server side?
Flags: needinfo?(kcambridge)
There will be a variety of names returned, they will be updated as clusters fill up. Ops will be supplying the names of the clusters to use.
Flags: needinfo?(kcambridge)
Ben, do you know what's the mechanism that will be used to know the list of names? also, when should we start to use them, and who in ops is knowledgeable about that?
Flags: needinfo?(bbangert)
The current name is at the bottom of this ticket: https://bugzilla.mozilla.org/show_bug.cgi?id=1090504 I believe ops will be adding clusters, when they do, they will likely file a bug to ensure the names at this URL can be updated.
Flags: needinfo?(bbangert)
(In reply to Adam Roach [:abr] from comment #55) > Then, I'd write a script that makes many (hundreds or thousands) of requests > to https://loop.services.mozilla.com/push-server-config and records how many > times each server URL is returned. Each server should appear approximately > as many times as every other server. You can probably spot problems by hand; > if you want to be rigorous or automate the tests, I would check that the > standard deviation was below some reasonable threshold (e.g., stddev < 100 > for 10,000 samples). Load balancing working on stage: TRIES: 10000 RESPONSE TALLY: {'loop-push2.stage.mozaws.net': 4939, 'loop-push1.stage.mozaws.net': 5061} RUNTIME: 7111.8 seconds STANDARD DEVIATION: 86.3 Will re-verify when this gets deployed to prod with loop-push 1.5.0 see: https://bugzilla.mozilla.org/show_bug.cgi?id=1104994#c11 Note on https://bugzilla.mozilla.org/show_bug.cgi?id=1104994#c9: We decided to go with only one push endpoint -- push2 currently exists in stage but not in prod.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: