1055139 - Loop client needs to provide hook for load balancing Simple Push servers

Assignee

Description

•

10 years ago

Currently, the loop client uses the built-in browser pref "services.push.serverURL" to determine which push server to use. If Loop were to go into release with this configuration, it is very likely that we would overload the simple push server rather quickly. After discussions with the simple push team, one tool that we've settled on to address this situation is to retreive the address of the Simple Push server from the Loop server at client startup. This allows for two important changes: 1) We can deploy dedicated simple push clusters for Loop, probably with certain unneeded and resource-intensive features disabled, and 2) We can set up the Loop server to distribute load among multiple simple push clusters by returning different simple push URLs to clients when they ask for a URL. To provide for a transition path, the current patch falls back to using services.push.serverURL if the attempt to retreive a URL from the Loop server fails. I will be filing a followup bug to remove this behavior (i.e., to fail the connection if the Loop server cannot provide a Simple Push Server URL, the registration attempt fails): we don't want to knock over the simple push server if something starts going wrong handing out URLs.

Adam Roach [:abr]

Assignee

Comment 1

•

10 years ago

Attached patch Retrieve Simple Push Server URL from Loop Server (obsolete) — Details — Splinter Review

Adam Roach [:abr]

Assignee

Updated

•

10 years ago

Comment 2

•

10 years ago

Push server fall-back removal bug filed as Bug 1055145

Adam Roach [:abr]

Assignee

Comment 3

•

10 years ago

Anthony: just pinging you with a needinfo so you're aware of this feature. We want to land this in 34 so that we can ensure scaling once we get into release -- the simple push server folks have highlighted that we may well run into scaling issues without something like this.

Flags: needinfo?(anthony.s.hughes)

u279076

Comment 4

•

10 years ago

Thanks for the heads up, Adam. James, since this deals with scalability could you take a look at it?

Flags: needinfo?(anthony.s.hughes) → needinfo?(jbonacci)

QA Contact: jbonacci

James Bonacci [:jbonacci]

Comment 5

•

10 years ago

:ashughes I will be tracking bug 1055143 directly. I will still need help from the client teams to verify changes made to current client-side functionality for this bug and bug 1055145. You can leave me assigned to the bug for now, though... Thanks.

Flags: needinfo?(jbonacci)

u279076

Comment 6

•

10 years ago

Thanks James. I will of course help you with client-side testing when the time comes. Feel free to need-info me.

Adam Roach [:abr]

Assignee

Comment 7

•

10 years ago

Attached patch Retrieve Simple Push Server URL from Loop Server (obsolete) — Details — Splinter Review

Mark: I've manually verified this by running it against the dev server, and it does what it's supposed to. I've also induced a number of failures -- some accidentally :-) -- and been through the error paths. Unfortuantely, isolating this behavior to sanity check it in an automated test would be a pretty big rigamarole, so I'm not landing any tests with it.

Attachment #8480211 - Flags: review?(standard8)

Adam Roach [:abr]

Assignee

Updated

•

10 years ago

Attachment #8474661 - Attachment is obsolete: true

Adam Roach [:abr]

Assignee

Comment 8

•

10 years ago

Comment on attachment 8480211 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Tagging Tim and Mark in case they can get to this first -- I'll take the first r+ that comes in. :) This is important for not melting the push infrastructure when Loop/Hello goes live in 34 release. We need to land it before uplift or uplift it into Aurora. It's a lot easier if we don't need to uplift it later. :)

Attachment #8480211 - Flags: review?(ttaubert)

Attachment #8480211 - Flags: review?(mhammond)

Mark Banner (:standard8)

Comment 9

•

10 years ago

Re tests, I think it might actually be fairly simple with the xpcshell test infrastructure we already have in place, but I haven't got time until after the merges to look at it, and I don't think I'd block on it either.

Maire Reavy [:mreavy]

Comment 10

•

10 years ago

[Tracking Requested - why for this release]: See Comment 8. We need this to make sure that the push infrastructure doesn't suffer when loop goes live. This may land before uplift, but if it doesn't, we need this in Fx34.

status-firefox34: --- → affected

tracking-firefox34: --- → ?

Mark Hammond [:markh] [:mhammond]

Comment 11

•

10 years ago

Comment on attachment 8480211 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Review of attachment 8480211 [details] [diff] [review]: ----------------------------------------------------------------- Looks reasonable to me. ::: browser/components/loop/MozLoopPushHandler.jsm @@ +229,5 @@ > + let req = Cc["@mozilla.org/xmlextras/xmlhttprequest;1"]. > + createInstance(Ci.nsIXMLHttpRequest); > + req.open("GET", pushUrlEndpoint); > + req.onload = () => { > + if (req.status < 300) { seems almost impossible we'd see a 1XX response, but still, we might as well check for >= 200 @@ +239,5 @@ > + console.warn("MozLoopPushHandler - push server URL config lacks pushServerURI parameter"); > + pushServerURLFetchError(); > + } > + } else { > + console.warn("MozLoopPushHandler - push server URL retrieve error: " + req.status); To be robust, it might make sense to have an exception handler for the JSON parse?

Attachment #8480211 - Flags: review?(mhammond) → review+

Adam Roach [:abr]

Assignee

Comment 12

•

10 years ago

https://hg.mozilla.org/integration/fx-team/rev/49cb37b48c05

Adam Roach [:abr]

Assignee

Updated

•

10 years ago

Attachment #8480211 - Flags: review?(ttaubert)

Attachment #8480211 - Flags: review?(standard8)

Adam Roach [:abr]

Assignee

Comment 13

•

10 years ago

Oops -- closing the network is verboten during tests. Fixing... https://hg.mozilla.org/integration/fx-team/rev/93b6e71b019a

Phil Ringnalda (:philor)

Comment 14

•

10 years ago

Backed both out in https://hg.mozilla.org/integration/fx-team/rev/95b54a829a57 for still touching the network, now in browser_mozLoop_doNotDisturb.js.

Adam Roach [:abr]

Assignee

Comment 15

•

10 years ago

Trying to track down why this patch makes us touch the push server; cf: https://tbpl.mozilla.org/php/getParsedLog.php?id=47186290&tree=Fx-Team&full=1#error0 https://tbpl.mozilla.org/php/getParsedLog.php?id=47183591&tree=Fx-Team&full=1#error0

Adam Roach [:abr]

Assignee

Comment 16

•

10 years ago

Working theory is that the extra async involved in getting the push server URL gives the mochitest time to put the real prefs back, so that we actually try to use a real pushserver instead of localhost. This is a try of a that patch attempts to bypass the whole mess by marking the system offline during the Loop tests: https://tbpl.mozilla.org/?tree=Try&rev=786df7cfe306

Adam Roach [:abr]

Assignee

Comment 17

•

10 years ago

Attached patch Retrieve Simple Push Server URL from Loop Server (obsolete) — Details — Splinter Review

Adam Roach [:abr]

Assignee

Updated

•

10 years ago

Attachment #8480211 - Attachment is obsolete: true

Adam Roach [:abr]

Assignee

Comment 18

•

10 years ago

Comment on attachment 8482759 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Carrying forward r+ from mhammond

Attachment #8482759 - Flags: review+

Adam Roach [:abr]

Assignee

Comment 19

•

10 years ago

Comment on attachment 8482759 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Mark: Can I get an r+ from you on the head.js change here? This is the change we discussed in our conversation in IRC.

Attachment #8482759 - Flags: review?(standard8)

Mark Banner (:standard8)

Comment 20

•

10 years ago

Comment on attachment 8482759 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server Review of attachment 8482759 [details] [diff] [review]: ----------------------------------------------------------------- r+ for the mochitest parts. ::: browser/components/loop/test/mochitest/head.js @@ +56,5 @@ > // Set prefs to ensure we don't access the network externally. > Services.prefs.setCharPref("services.push.serverURL", "ws://localhost/"); > Services.prefs.setCharPref("loop.server", "http://localhost/"); > + let wasOffline = Services.io.offline; > + Services.io.offline = true; I think maybe add a comment along the lines of: // Turn off the network for loop tests, so that we don't // try to access the remote servers. If we want to turn this // back on in future, be careful to check for intermittent // failures.

Attachment #8482759 - Flags: review?(standard8) → review+

Adam Roach [:abr]

Assignee

Comment 21

•

10 years ago

Looks good on try, so let's give this another spin: https://hg.mozilla.org/integration/fx-team/rev/e6cf07180934

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

10 years ago

status-firefox35: --- → affected

tracking-firefox34: ? → +

tracking-firefox35: --- → +

Adam Roach [:abr]

Assignee

Comment 22

•

10 years ago

Okay, that broke xpcshell again. Backing out *again*: https://hg.mozilla.org/integration/fx-team/rev/2dd9b2f2ef8f

Adam Roach [:abr]

Assignee

Comment 23

•

10 years ago

Curse the fact that I'm working on two bugs with extremely similar bug numbers. I think I got my push tests backwards - I pushed '139 with browser mochi tests, and '319 with xpcshell tests, which is backwards from what I wanted. Let's level-set. This is a try push of *this* bug with xpcshell tests. Given that it failed on fx-team, I expect it to fail on try. If not, we have bigger problems to deal with: https://tbpl.mozilla.org/?tree=Try&rev=e65618604015

Adam Roach [:abr]

Assignee

Comment 24

•

10 years ago

And now, to be hyper-defensive after bouncing off the tree so many times, here's a full suite of tests run on try: https://tbpl.mozilla.org/?tree=Try&rev=79c6b5703bb8

Adam Roach [:abr]

Assignee

Comment 25

•

10 years ago

Okay. That try was as green as tries ever get -- two known intermittent oranges and a really odd Android debug crash that is not plausibly related to my patch. I'm going to call that a clean bill of health; so, once more, with feeling: https://hg.mozilla.org/integration/fx-team/rev/c8dc98469fc0

Ryan VanderMeulen [:RyanVM]

Comment 26

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/c8dc98469fc0

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

status-firefox35: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla35

Adam Roach [:abr]

Assignee

Comment 27

•

10 years ago

Attached patch Retrieve Simple Push Server URL from Loop Server (as landed) — Details — Splinter Review

Attachment #8482759 - Attachment is obsolete: true

Adam Roach [:abr]

Assignee

Comment 28

•

10 years ago

Comment on attachment 8484463 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server (as landed) Carrying forward r+ from mhammond, Standard8 Approval Request Comment [Feature/regressing bug #]: Loop [User impact if declined]: Overall capacity of Loop service would be constrained to whatever could be supported by one simple push cluster (~5 million users). After we reach that level, we would need to take steps to disable the Loop service for remaining users, resulting in partial service unavailability. [Describe test coverage new/current, TBPL]: The changes made by this patch have been verified by running it against the http://loop-dev.stage.mozaws.net/ development server, which already has the server-side portion of this functionality implemented. Based on logging messages added to the patch (temporarily, during testing), the function works as planned. [Risks and why]: The risks to taking this patch are very low for the system in general, as the changes are isolated to Loop's push server handling. The risk to the Loop server handling is also low; the maximum impact is that a failure induced by this change may prevent incoming Loop calls from arriving. This would be detected and reported to the user as an error. [String/UUID change made/needed]: None

Attachment #8484463 - Flags: review+

Attachment #8484463 - Flags: approval-mozilla-aurora?

Lawrence Mandel [:lmandel] (use needinfo)

Comment 29

•

10 years ago

Comment on attachment 8484463 [details] [diff] [review] Retrieve Simple Push Server URL from Loop Server (as landed) From speaking with abr: This is a temporary solution until a new server is in deployment. Once the new server is deployed, we're going to want to pull out the logic that falls back to the default server. Aurora+

Attachment #8484463 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

u279076

Comment 30

•

10 years ago

James, can you please have a look at this to determine what testing this needs?

Flags: qe-verify?

Flags: needinfo?(jbonacci)

James Bonacci [:jbonacci]

Comment 31

•

10 years ago

:ashughes - I will see what I can do. I don't really have full context anymore. I will talk to the Loop-Server and SimplePush Devs for more info about getting the SP (Prod) server info dynamically from Loop-Server Prod ... + tarek + ben Starting with "needinfo" but will probably just have an offline discussion...

Flags: needinfo?(tarek)

Flags: needinfo?(jbonacci)

Flags: needinfo?(bbangert)

James Bonacci [:jbonacci]

Comment 32

•

10 years ago

:ashughes So, just to clarify some flags. How is this "fixed" for Fx35? What do we want to see for Fx34? services.push.serverURL is defined for Fx33 as well. What are our plans for this release?

James Bonacci [:jbonacci]

Updated

•

10 years ago

Flags: needinfo?(anthony.s.hughes)

u279076

Comment 33

•

10 years ago

status-firefox35:fixed just means that the patch landed in mozilla-central during the Firefox 35 timeframe (comment 26). Request for Aurora 34 uplift was requested (comment 28) and approved (comment 29) but the landing on mozilla-aurora has not yet happened. Based on that my assumption is we want 35 and 34 to be the same here. 33 is a different question though, maybe Adam Roach can answer that?

Flags: needinfo?(anthony.s.hughes) → needinfo?(adam)

Tarek Ziadé (:tarek)

Comment 34

•

10 years ago

James: I think we have everything covered in our functional tests. I guess one interesting thing to do would be to add an assertion in the load tests to verify that we're really doing some load balancing and how effective is the shuffling. (as long as we *do* have several simple push nodes in place)

Flags: needinfo?(tarek)

Adam Roach [:abr]

Assignee

Comment 35

•

10 years ago

(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #33) > status-firefox35:fixed just means that the patch landed in mozilla-central > during the Firefox 35 timeframe (comment 26). Request for Aurora 34 uplift > was requested (comment 28) and approved (comment 29) but the landing on > mozilla-aurora has not yet happened. > > Based on that my assumption is we want 35 and 34 to be the same here. 33 is > a different question though, maybe Adam Roach can answer that? Loop in FFx33 is only going to be active for beta -- it won't ride out into release, so we don't need this mechanism. Loop in 34 is going to be in release, but only in the customization palette. We're going to want to manage load just in case we see a really big uptake of the feature despite its relative obscurity. Look in 35 (and later) will definitely need this mechanism, as we're going to need the ability to scale across multiple clusters. We will probably be landing a follow-up bug in the 35 timeframe that removes the "fallback to configured value", but we'll wait until the loop server that supports this functionality is deployed first.

Flags: needinfo?(adam)

James Bonacci [:jbonacci]

Comment 36

•

10 years ago

OK. Thanks for all the updates. So a lot of the initial work/test/verification needs to happen on the server-side. We can not do that until there is an SP Stage and and SP Prod environment. Then we can look at the SP <------> Loop-Server part of the pipeline. After that, services QA can work with client QA to make sure that Loop-Server returns what it is supposed to return for the client requests. Plus the various error handling at both ends and on/from all the services. Fun.

Flags: needinfo?(bbangert)

Ryan VanderMeulen [:RyanVM]

Comment 37

•

10 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/6d0f37636dfb

status-firefox34: affected → fixed

u279076

Comment 38

•

10 years ago

Flagging this for QE verification based on comment 36. James, please let me know when we're ready for client-side testing.

Flags: qe-verify? → qe-verify+

:shell escalante

Updated

•

10 years ago

Whiteboard: [loop-uplift]

u279076

Comment 39

•

10 years ago

Hi James, What's the QE status here? I need to make sure this is verified fixed against Nightly and Fig in the next 24-48 hours.

Flags: needinfo?(jbonacci)

Ben Bangert [:benbangert]

Comment 40

•

10 years ago

SP test and stage environments are in the process of being setup (https://bugzilla.mozilla.org/show_bug.cgi?id=1072495). In a discussion today with ops there are several cases the client will need to handle: 1) The client will need to change SP cluster its connected to if it becomes 'full'. To do this it will need to re-query the cluster URL to locate a new cluster to connect to, and update Loop-server with its new endpoint URL. 2) For SP cluster updates, ops may either do a rolling restart, or more likely setup a new SP cluster and change DNS to point to the replacement cluster. In this case, the client will need to drop its connection, query DNS for new IP's, and reconnect to the replacement cluster. I'm unaware of exactly what the client does with the SP name it chooses from the URL supplied at the moment, or how we can instruct it that it must re-choose (in event 1 I note).

u279076

Comment 41

•

10 years ago

Richard, since James is on PTO can you have a look at this?

Flags: needinfo?(jbonacci) → needinfo?(rpappalardo)

Whiteboard: [loop-uplift] → [loop-uplift][fig:verifyme]

u279076

Updated

•

10 years ago

QA Contact: jbonacci → rpappalardo

Richard Pappalardo [:rpapa][:rpappalardo]

Comment 42

•

10 years ago

I'm not yet clear on what's still needed here, but let me touch base with :kthiessen tomorrow & get back to you

Karl Thiessen [:kthiessen, he/him]

Updated

•

10 years ago

Depends on: 1072495

Karl Thiessen [:kthiessen, he/him]

Comment 43

•

10 years ago

As far as I understand it, verification of this bug depends on bug 1072495 being complete.

Richard Pappalardo [:rpapa][:rpappalardo]

Comment 44

•

10 years ago

< ashughes> rpapa: in reading it further I think this may depend on bug 1072495 -- it may just need to wait until James is back

u279076

Comment 45

•

10 years ago

I'm marking this wontverify for Fig since we're still blocked here.

Whiteboard: [loop-uplift][fig:verifyme] → [loop-uplift][fig:wontverify]

Randell Jesup [:jesup] (needinfo me)

Comment 46

•

10 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/193f3c50bcd3

Richard Pappalardo [:rpapa][:rpappalardo]

Updated

•

10 years ago

Flags: needinfo?(jbonacci)

James Bonacci [:jbonacci]

Comment 47

•

10 years ago

Please see https://bugzilla.mozilla.org/show_bug.cgi?id=1055139#c40 https://bugzilla.mozilla.org/show_bug.cgi?id=1055139#c43 https://bugzilla.mozilla.org/show_bug.cgi?id=1055139#c44 So, no. This can not be verified until the SP Stage and Production environments are up and running. Clear whatever fields need to be cleared. We will get to this (both client and server teams) once the environments are done.

Flags: needinfo?(rpappalardo)

Flags: needinfo?(jbonacci)

u279076

Comment 48

•

10 years ago

Marking this as qe-verify- based on comment 47. Please reflag this + if/when this becomes verifiable.

Flags: qe-verify+ → qe-verify-

James Bonacci [:jbonacci]

Comment 49

•

10 years ago

:ashughes I added some dependencies that will make this QA work easier to schedule...

Depends on: 1082188

Edwin Wong [:edwong]

Comment 50

•

10 years ago

:jbonacci - stage is up - can we test with that to verify this as fixed.

James Bonacci [:jbonacci]

Comment 51

•

10 years ago

Yes, we can test this. I need some coordination with the client teams. I guess we need to make sure that the client (in some version of Fx) can now talk to the stack (Prod). Then we need OPs to verify that the communication is really happening. We also need to verify the configs are changed for loop-client. Not sure if this is built into some Fx release already, or if we need to make an actual change to loop-client (deployed) or standalone. Not sure we can generate enough traffic via client to verify actual SP load balancing. (the Loop-Server load test does not hit SP, nor is it supposed to) If the SP load test is high traffic enough, that will be our "load balancing" verification step. Separately, any automated tests with the config set correctly, should generate "enough" traffic to allow OPs to verify that the Prod ELB is being hit and individual SP instances are being used to handle that traffic. Anything else I am missing? I highly recommend a meeting to get SP Devs, Loop-Client Devs, QA, and OPs around a table to get the quickest/shortest path to verification. Neither Richard or I should be in the QA field for this bug.

QA Contact: rpappalardo → anthony.s.hughes

u279076

Comment 52

•

10 years ago

(In reply to James Bonacci [:jbonacci] from comment #51) > I highly recommend a meeting to get SP Devs, Loop-Client Devs, QA, and OPs > around a table to get the quickest/shortest path to verification. Let's continue this discussion via email. I'll set up a meeting once I get a complete list of people who need to be there.

u279076

Updated

•

10 years ago

Flags: qe-verify- → qe-verify+

Richard Pappalardo [:rpapa][:rpappalardo]

Updated

•

10 years ago

No longer depends on: 1082188

James Bonacci [:jbonacci]

Comment 53

•

10 years ago

Cool. Thanks :rpapa Adding the Prod deploy as a dependency for this bug. That is what we really need.

Depends on: 1090504

Maire Reavy [:mreavy]

Comment 54

•

10 years ago

Adam -- Can you read over Comment 51 (https://bugzilla.mozilla.org/show_bug.cgi?id=1055139#c51) and confirm that the loop client is ready for testing? I believe it is, but you're closer to this than I am. Thanks.

Flags: needinfo?(adam)

Adam Roach [:abr]

Assignee

Comment 55

•

10 years ago

(In reply to James Bonacci [:jbonacci] from comment #51) > Yes, we can test this. I need some coordination with the client teams. > I guess we need to make sure that the client (in some version of Fx) can now > talk to the stack (Prod). > Then we need OPs to verify that the communication is really happening. > > We also need to verify the configs are changed for loop-client. Not sure if > this is built into some Fx release already, or if we need to make an actual > change to loop-client (deployed) or standalone. Ready for testing in Nightly 36: http://dxr.mozilla.org/mozilla-central/source/browser/components/loop/MozLoopPushHandler.jsm#290 Ready for testing in Aurora 35: http://hg.mozilla.org/releases/mozilla-aurora/file/afef8d8afd37/browser/components/loop/MozLoopPushHandler.jsm#l227 Ready for testing in Beta 34: http://hg.mozilla.org/releases/mozilla-beta/file/c94fc6b83daa/browser/components/loop/MozLoopPushHandler.jsm#l227 The change is not in Release 33. The standalone client does not use the SP server, so it does not need verification. > Not sure we can generate enough traffic via client to verify actual SP load > balancing. > (the Loop-Server load test does not hit SP, nor is it supposed to) > If the SP load test is high traffic enough, that will be our "load > balancing" verification step. > Separately, any automated tests with the config set correctly, should > generate "enough" traffic to allow OPs to verify that the Prod ELB is being > hit and individual SP instances are being used to handle that traffic. So, I would test this using a couple of different techniques. First, an end-to-end test where you bring up several instances of Firefox, and verify that they're not all connecting to the same simple push server. That will verify that the client and server are interacting with each other properly. Then, I'd write a script that makes many (hundreds or thousands) of requests to https://loop.services.mozilla.com/push-server-config and records how many times each server URL is returned. Each server should appear approximately as many times as every other server. You can probably spot problems by hand; if you want to be rigorous or automate the tests, I would check that the standard deviation was below some reasonable threshold (e.g., stddev < 100 for 10,000 samples). Note that both of these tests assume that the Loop server has been configured to point at several simple push clusters rather than the single "wss://push.services.mozilla.com/" they appear to be configured with right now.

Flags: needinfo?(adam)

u279076

Comment 56

•

10 years ago

James, can you please advise how we can reliably test this given what Adam has said in comment 55?

Flags: needinfo?(jbonacci)

James Bonacci [:jbonacci]

Comment 57

•

10 years ago

:ashughes we will need to coordinate with the SP Dev team and the OPs team. That is why I wanted them in the meeting. Also, I did not even know this endpoint existed: https://loop.services.mozilla.com/push-server-config.

Flags: needinfo?(jbonacci)

Richard Pappalardo [:rpapa][:rpappalardo]

Comment 58

•

10 years ago

>Then, I'd write a script that makes many (hundreds or thousands) of requests to >https://loop.services.mozilla.com/push-server-config and records how many times each server URL is >returned. Each server should appear approximately as many times as every other server. You can probably >spot problems by hand; if you want to be rigorous or automate the tests, I would check that the standard >deviation was below some reasonable threshold (e.g., stddev < 100 for 10,000 samples). This sounds like a load test, though doing something different than what push-tester does. Do we want to add an additional test for this?

Edwin Wong [:edwong]

Comment 59

•

10 years ago

:rpappa - i'd probably just curl the endpoint in a loop and record the value rather than make it more formal. I think the steps to verify this bug as fixed are: 1. Ops or dev need to wire up the load balancing URL is configured to return 50%/50% or (some percentage therein) between two URLs (possibly cnames to the same server). 2. CldSvcs curls that endpoint and gets the percentage distribution over say 100 curls 3. Client QA opens up nightly with Console and confirms percentage of each endpoint are represented.

James Bonacci [:jbonacci]

Comment 60

•

10 years ago

:rpapa the loop-server load test does not hit the SP server or that endpoint, so normally, an extra test would be needed, as :abr pointed out :-) :edwong has it right - simpler is better

James Bonacci [:jbonacci]

Comment 61

•

10 years ago

The last bit of my first sentence fell away during copy/paste apparently. :rpapa you are right and I am right - neither load test currently hits this endpoint...

James Bonacci [:jbonacci]

Updated

•

10 years ago

Depends on: 1091100

Adam Roach [:abr]

Assignee

Comment 62

•

10 years ago

To be clear, I'm not proposing this as a load test of the loop server -- the repeated check is to get a statistically significant number of samples of the URL it hands out for the loop server. It doesn't matter if you do them "best speed" or over the course of a few hours.

u279076

Comment 63

•

10 years ago

Nominating this to block Loop 35 based on today's SimplePush QA coordination meeting. Maire confirmed this is a nice-to-have for 34 given the softlaunch but a hard requirement for 35. Aiming for 35 will give Ops & QA the time we need to not cut corners in our testing and make sure we do this right.

backlog: --- → Fx35?

Lina Butler [:lina]

Updated

•

10 years ago

Depends on: 1091192

Alexis Metaireau (:alexis)

Comment 64

•

10 years ago

I'm not clear here what the configuration for the server should look like? What are the simple push endpoints we should use here?

Flags: needinfo?(kcambridge)

Ben Bangert [:benbangert]

Comment 65

•

10 years ago

:alexis, the Loop server, or the client? The Loop server will store the endpoint the client hands it as usual, clients will hand the loop-server endpoints under different domains so the entire endpoint URL will need to be held by loop-server.

Lina Butler [:lina]

Comment 66

•

10 years ago

Yes, the `/push-server-config` exposed by the Loop server is fine. The client just needs to handle the "redirect" and "cluster full" replies sent by the Push server after the opening handshake. In case the cluster is full, the client will need to make another request to `/push-server-config` and obtain a push URL for the new cluster.

Flags: needinfo?(kcambridge)

:shell escalante

Updated

•

10 years ago

Iteration: --- → 35.1

Adam Roach [:abr]

Assignee

Comment 67

•

10 years ago

(In reply to Kit Cambridge (:spindrift, :kcambridge) from comment #66) > Yes, the `/push-server-config` exposed by the Loop server is fine. The > client just needs to handle the "redirect" and "cluster full" replies sent > by the Push server after the opening handshake. In case the cluster is full, > the client will need to make another request to `/push-server-config` and > obtain a push URL for the new cluster. I'm a little confused here. The whole point of having the push server endpoint distributed by the loop server was that the loop server could hand out different values to different clients. It sounds like you're trying to do something different than that now?

Lina Butler [:lina]

Comment 68

•

10 years ago

(In reply to Adam Roach [:abr] from comment #67) > I'm a little confused here. The whole point of having the push server > endpoint distributed by the loop server was that the loop server could hand > out different values to different clients. Yep—but the `/push-server-config` endpoint hands out the hostname of the Push cluster. As I understand it, when the client connects to that URL, it resolves the hostname to a random node in the cluster—and caches the DNS lookup. If that Push node is overloaded, we need to redirect the client to a different one. We're worried that, if we just close the connection, the client will reach into its DNS cache and reconnect to the same overloaded node. It also looks like the client is caching the response from `/push-server-config`: https://github.com/mozilla/gecko-dev/blob/master/browser/components/loop/MozLoopPushHandler.jsm#L289-291 If the entire cluster becomes overloaded, the client will need to requery that endpoint for a new Push cluster hostname.

Alexis Metaireau (:alexis)

Comment 69

•

10 years ago

I thought we would return a different node domain name, not the same all the time, otherwise there is almost no point doing that. So, are you saying the round robin is handled directly by DNS lookup? Also,how do we know what are the values that should be configured on the loop server side?

Flags: needinfo?(kcambridge)

Ben Bangert [:benbangert]

Comment 70

•

10 years ago

There will be a variety of names returned, they will be updated as clusters fill up. Ops will be supplying the names of the clusters to use.

Flags: needinfo?(kcambridge)

Alexis Metaireau (:alexis)

Comment 71

•

10 years ago

Ben, do you know what's the mechanism that will be used to know the list of names? also, when should we start to use them, and who in ops is knowledgeable about that?

Flags: needinfo?(bbangert)

Ben Bangert [:benbangert]

Comment 72

•

10 years ago

The current name is at the bottom of this ticket: https://bugzilla.mozilla.org/show_bug.cgi?id=1090504 I believe ops will be adding clusters, when they do, they will likely file a bug to ensure the names at this URL can be updated.

Flags: needinfo?(bbangert)

Richard Pappalardo [:rpapa][:rpappalardo]

Comment 73

•

10 years ago

(In reply to Adam Roach [:abr] from comment #55) > Then, I'd write a script that makes many (hundreds or thousands) of requests > to https://loop.services.mozilla.com/push-server-config and records how many > times each server URL is returned. Each server should appear approximately > as many times as every other server. You can probably spot problems by hand; > if you want to be rigorous or automate the tests, I would check that the > standard deviation was below some reasonable threshold (e.g., stddev < 100 > for 10,000 samples). Load balancing working on stage: TRIES: 10000 RESPONSE TALLY: {'loop-push2.stage.mozaws.net': 4939, 'loop-push1.stage.mozaws.net': 5061} RUNTIME: 7111.8 seconds STANDARD DEVIATION: 86.3 Will re-verify when this gets deployed to prod with loop-push 1.5.0 see: https://bugzilla.mozilla.org/show_bug.cgi?id=1104994#c11 Note on https://bugzilla.mozilla.org/show_bug.cgi?id=1104994#c9: We decided to go with only one push endpoint -- push2 currently exists in stage but not in prod.

Richard Pappalardo [:rpapa][:rpappalardo]

Comment 74

•

10 years ago

noting config changes here: https://github.com/mozilla-services/puppet-config/blob/394c34be51be335e4fab9206c40c8c54edd0400f/loop/yaml/app/loop_server.stage.yaml#L41 https://github.com/mozilla-services/puppet-config/blob/ffad1b881029ec515a75482ca8421161f18b7e6a/loop/yaml/app/loop_server.prod.yaml#L42

Retrieve Simple Push Server URL from Loop Server 10 years ago Adam Roach [:abr] 3.51 KB, patch		Details \| Diff \| Splinter Review
Retrieve Simple Push Server URL from Loop Server 10 years ago Adam Roach [:abr] 3.43 KB, patch	markh : review+	Details \| Diff \| Splinter Review
Retrieve Simple Push Server URL from Loop Server 10 years ago Adam Roach [:abr] 4.77 KB, patch	abr : review+ standard8 : review+	Details \| Diff \| Splinter Review
Retrieve Simple Push Server URL from Loop Server (as landed) 10 years ago Adam Roach [:abr] 8.70 KB, patch	abr : review+ lmandel : approval-mozilla-aurora+	Details \| Diff \| Splinter Review