Closed Bug 1251307 Opened 9 years ago Closed 9 years ago

Please deploy loop-server 0.19.3 to PRODUCTION

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: cloud-services-qa, Assigned: jschneider)

References

Details

Tentatively scheduled for release on Tues., 3/1 @: 1pm PST / 4pm EST
Assignee: nobody → jschneider
Status: NEW → ASSIGNED
Depends on: 1248000
QA Contact: chartjes
re-scheduled for today @ 10am PST / 1pm EST
============================ PRE-DEPLOYMENT: ============================ Here's what's currently in production: Placed several calls successfully between Nightly (47.0a1) and GR (44.0.2) using production loop-server 0.18.2 stack. ---------------------------- E2E TESTS ---------------------------- TESTS messaging - OK Tab & window-sharing - OK Video/audio mute/unmute - OK Room notifications - OK end-2-end test calls - OK ---------------------------- URL CHECKS (PROD) ---------------------------- curl https://loop.services.mozilla.com | python -m json.tool { "description": "The Mozilla Loop (WebRTC App) server", "endpoint": "https://loop.services.mozilla.com", "fakeTokBox": false, "fxaOAuth": true, "homepage": "https://github.com/mozilla-services/loop-server/", "i18n": { "defaultLang": "en-US" }, "name": "mozilla-loop-server", "version": "0.18.2" } curl https://loop.services.mozilla.com/__heartbeat__ | python -m json.tool { "fxaVerifier": true, "provider": true, "storage": true } curl https://loop.services.mozilla.com/push-server-config | python -m json.tool { "pushServerURI": "wss://push.services.mozilla.com" }
New stack: "ELBDNSName": "loopsvrprod1-l-ELB-573C03V9CF29-1020574114.us-west-2.elb.amazonaws.com", "ELBFQDN": "loopsvrprod1-l-ELB-573C03V9CF29-1020574114.us-west-2.elb.amazonaws.com
Old stackOld stack: dualstack.loopsvrprod1-l-elb-14rdb0b303rer-691625340.us-west-2.elb.amazonaws.com.
============================================ PRE-PRODUCTION (INCOMING) STACK VERIFICATION ============================================ E2E tests and stack check okay. Heartbeat check was showing intermittent errors. As per conversation with :bobm and :jp, ops to follow up with Tarek's team to modify heartbeat check for push. Halting this release. Notice from Sentry: Regression on Loop-Server loopserver-prod Error: Heartbeat: {"storage":true,"provider":true,"push":false,"fxaVerifier":true} Tags level = error logger = root server_name = ip-172-31-34-117 Exception Error: Heartbeat: {"storage":true,"provider":true,"push":false,"fxaVerifier":true} File "/data/loop-server/loop/routes/home.js", line 59, in returnStatus logError(new Error("Heartbeat: " + JSON.stringify(data))); File "/data/loop-server/loop/routes/home.js", line 82, in null.<anonymous> returnStatus(storageStatus, tokboxError, pushStatus, verifierStatus); File "/data/loop-server/loop/routes/home.js", line 30, in Request._callback if (error) return callback(error); ... (6 additional frame(s) were not displayed)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Consider adding a load balancer specific health check to the service. See: https://bugzilla.mozilla.org/show_bug.cgi?id=1246008 for a similar request.
(In reply to Bob Micheletto [:bobm] from comment #7) > Consider adding a load balancer specific health check to the service. See: > https://bugzilla.mozilla.org/show_bug.cgi?id=1246008 for a similar request. Thanks, Bob. Per our vidyo this morning, we'll need to have a reliable __heartbeat__ check in place for push before we can re-deploy to PROD. :natim, :tarek, is this something you guys might be able to add to heartbeat check?
Flags: needinfo?(tarek)
Flags: needinfo?(rhubscher)
Can we release this one without that change and add it to the next release which will happen during this week with new features?
Flags: needinfo?(rhubscher)
(In reply to Rémy Hubscher (:natim) from comment #9) > Can we release this one without that change and add it to the next release > which will happen during this week with new features? It's a good point as I believe push was just going to be added to the heartbeat w/ this release so, in theory, we're not breaking anything that was working before. Though we would have to choose to ignore the push heartbeat til the next release. but I defer to Ops for this one :jp, :bobm?
Flags: needinfo?(jschneider)
Flags: needinfo?(bobm)
We could also increase the timeout value on the push heartbeat call.
The default value of the config for heartbeatTimeout is 2000ms we may want to wait one more second before telling the push endpoint is broken.
(In reply to Richard Pappalardo [:rpapa][:rpappalardo] from comment #10) > It's a good point as I believe push was just going to be added to the > heartbeat w/ this release so, in theory, we're not breaking anything that > was working before. Though we would have to choose to ignore the push > heartbeat til the next release. > > but I defer to Ops for this one :jp, :bobm? I defer to jp!
Flags: needinfo?(bobm)
While I don't love having TCP healthchecks on load balancers, it's how we're currently running, so I won't block on it. As a heads up, until we make an lbheartbeat endpoint which doesn't exercise resource dependencies to give a 200 OK, we run the risk of having unhealthy nodes in our load balancer.
I fished this bit out of our documentation in mana: "/__heartbeat__ Should return a 200 if the service is healthy, and a 500 otherwise. This should check dependent services like the database connection to ensure that they are healthy /__lbheartbeat__ Should respond 200 if the service is up, 500 otherwise. This is for load balancer checks and should not check dependent services." Right there we all get what we want. :)
Flags: needinfo?(jschneider)
OK, sounds good. :jp lets try and sync-up w/ :chartjes tomorrow and make a plan.
Flags: needinfo?(tarek)
Ops has given OK to move forward with 0.19.3 so re-opening ticket. Deployment will be tomorrow, Thurs. 3/3 @: 9am PST / 11am CST / 12pm EST :jp to follow-up w/ :natim to modify heartbeat prior to next deploy
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
I've built the new stack, and it's ready at "ELBDNSName": "loopsvrprod1-l-ELB-LG8XPE596N1V-2051374085.us-west-2.elb.amazonaws.com", "ELBFQDN": "loopsvrprod1-l-ELB-LG8XPE596N1V-2051374085.us-west-2.elb.amazonaws.com"}
The __lbhealthcheck__ feature for loop-server is getting implemented in Bug 1253257
JP Schneider it is strange that the version displayed on this instance [0] is set 0.20.0-dev [0] https://loopsvrprod1-l-elb-lg8xpe596n1v-2051374085.us-west-2.elb.amazonaws.com/
Apparently it is an error in my tag: https://github.com/mozilla-services/loop-server/blob/0.19.3/package.json#L4 I am going to fix that in a 0.19.4 release if that is ok for you.
============================ PRE-DEPLOYMENT: ============================ Here's what's currently in pre-production: Placed several calls successfully between Nightly (47.0a1) and GR (44.0.2) using pre-production loop-server 0.19.3 stack. ---------------------------- E2E TESTS ---------------------------- TESTS messaging - OK Tab & window-sharing - OK Video/audio mute/unmute - OK Room notifications - OK end-2-end test calls - OK ---------------------------- URL CHECKS (PRE-PRODUCTION) ---------------------------- It's a known issue that the devs accidentally tagged this release as 0.20.0-dev. It is 0.19.3 that is up in pre-production. curl -k https://loop.services.mozilla.com | python -m json.tool { "description": "The Mozilla Loop (WebRTC App) server", "endpoint": "https://loop.services.mozilla.com", "fakeTokBox": false, "fxaOAuth": true, "homepage": "https://github.com/mozilla-services/loop-server/", "i18n": { "defaultLang": "en-US" }, "name": "mozilla-loop-server", "version": "0.20.0-dev" } NOTE: Known issue that heartbeat is intermittently reporting that push is down when in fact the system is working correctly. :natim indicated a fix is in the works ~ ᐅ curl -k https://loop.services.mozilla.com/__heartbeat__ | python -m json.tool { "fxaVerifier": true, "provider": true, "push": true, "storage": true } ~ ᐅ curl -k https://loop.services.mozilla.com/__heartbeat__ | python -m json.tool { "fxaVerifier": true, "provider": true, "push": false, "storage": true } ~ ᐅ curl -k https://loop.services.mozilla.com/push-server-config | python -m json.tool { "pushServerURI": "wss://push.services.mozilla.com" } QA approved. Ready for DNS switch to production at scheduled deployment time of 12:00 Eastern Time.
Status: REOPENED → ASSIGNED
New stack : "ELBDNSName": "loopsvrprod1-l-ELB-LG8XPE596N1V-2051374085.us-west-2.elb.amazonaws.com", "ELBFQDN": "loopsvrprod1-l-ELB-LG8XPE596N1V-2051374085.us-west-2.elb.amazonaws.com"} We've got stackdriver constantly alerting due to the new heartbeat issue we know about. I'm going to disable that check for now.
Please consider changing heartbeatTimeout rather than deactivating the healthcheck. It is a configuration for loop-server that seems to be too low to let pushServer answer. We can configure: heartbeatTimeout: 3000
> QA approved. Ready for DNS switch to production at scheduled deployment time of 12:00 Eastern Time. I would rather not switch to production with the wrong version number displayed.
Met w/ Dev/Ops/QA on vidyo and decided to move ahead with 0.19.3 since we have a 0.20.0 tag (to be deployed next Thurs.)
Original IP's 52.24.142.188 52.88.50.37 54.68.81.130 Original Stack: loopsvrprod1-l-elb-14rdb0b303rer-691625340.us-west-2.elb.amazonaws.com. Switching to new stack: loopsvrprod1-l-ELB-LG8XPE596N1V-2051374085.us-west-2.elb.amazonaws.com Switched at 17:10:00 UTC
============================ PRODUCTION: ============================ Placed several calls successfully between Nightly (47.0a1) and GR (44.0.2) using production loop-server 0.19.3 stack. ---------------------------- E2E TESTS ---------------------------- TESTS messaging - OK Tab & window-sharing - OK Video/audio mute/unmute - OK Room notifications - OK end-2-end test calls - OK ---------------------------- URL CHECKS (PROD) ---------------------------- It's a known issue that the devs accidentally labelled this release as 0.20.0-dev. It is 0.19.3 that is up in production ~ ᐅ curl https://loop.services.mozilla.com | python -m json.tool { "description": "The Mozilla Loop (WebRTC App) server", "endpoint": "https://loop.services.mozilla.com", "fakeTokBox": false, "fxaOAuth": true, "homepage": "https://github.com/mozilla-services/loop-server/", "i18n": { "defaultLang": "en-US" }, "name": "mozilla-loop-server", "version": "0.20.0-dev" } NOTE: Known issue that heartbeat is intermittently reporting that push is down when in fact the system is working correctly. :natim has suggested a fix in the timeout length on the push server to fix it. ~ ᐅ curl https://loop.services.mozilla.com/__heartbeat__ | python -m json.tool { "fxaVerifier": true, "provider": true, "push": true, "storage": true } ~ ᐅ curl https://loop.services.mozilla.com/push-server-config | python -m json.tool { "pushServerURI": "wss://push.services.mozilla.com" } QA approved.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Forgot to move verified. One grumpy thumbs up.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.