1094587 - Please deploy server-syncstorage 1.5.11 to stage

Reporter

Description

•

11 years ago

This version of syncstorage tweaks some logging and adds a potential workaround for an apparent TokuDB bug: Bug 1063284 - Log at higher level when killing interrupted sql commands Bug 1057892 Comment 32 - workaround for TokuDB errors on PUT /meta/global This will hopefully reduce the level of 503s due to Bug 1057892 to close to zero. Please deploy to stage and do a loadtest run in preparation for production deployment. If possible let's prioritize this over the tokenserver deploy in Bug 1091313 so that we can get the workaround shipped.

Ryan Kelly [:rfkelly]

Reporter

Comment 1

•

11 years ago

Also worth noting, we should go back to the prod version of MariaDB for this test, not the updated version from Bug 1089945. Otherwise, if the errors do stop, we won't know what actually fixed them.

Bob Micheletto [:bobm]

Assignee

Updated

•

11 years ago

Assignee: nobody → bobm

Status: NEW → ASSIGNED

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 2

•

11 years ago

Do we have an ETA when this deployment will be done? Would be nice to know. Thanks.

Ryan Kelly [:rfkelly]

Reporter

Comment 3

•

11 years ago

ni? :bobm, but I think he's out for a conference this week.

Flags: needinfo?(bobm)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 4

•

11 years ago

It's 8 days later and still no update yet. Is Bob really the only person who can deploy changes? We would really appreciate if that can go live soon. Thanks.

Bob Micheletto [:bobm]

Assignee

Comment 5

•

11 years ago

This has been rolled to Stage.

Flags: needinfo?(bobm)

Ryan Kelly [:rfkelly]

Reporter

Comment 6

•

11 years ago

Sorry for the delay here Henrik, we're running light on QA resources at the moment so deploys are taking longer than usual to go through. Bob and I will see about doing a bit of our own smoketesting of this deploy in stage. Are you able to redirect the failing TPS tests to our stage server, to see if this change helps with the issues you are seeing? The stage sync endpoint is https://token.stage.mozaws.net/1.0/sync/1.5, and the stage FxA endpoints are listed at https://developer.mozilla.org/en-US/Firefox_Accounts#Stage

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 7

•

11 years ago

So the only thing I would have to do is to set the PUBLIC_URL environment variable to https://token.stage.mozaws.net/1.0/sync/1.5 as instructed in the following comment? https://github.com/mozilla/fxa-python-client/issues/25#issuecomment-60857774

Flags: needinfo?(hskupin)

Bob Micheletto [:bobm]

Assignee

Updated

•

11 years ago

QA Contact: kthiessen

Ryan Kelly [:rfkelly]

Reporter

Comment 8

•

11 years ago

> So the only thing I would have to do is to set the PUBLIC_URL environment variable to > https://token.stage.mozaws.net/1.0/sync/1.5 as instructed in the following comment? No, this will change the FxA instance used by the tests, but not the sync instance. (And now that I think about it, using production FxA with stage sync should work just fine). How do you tests discover what sync server to use? Do they just use the default value built into firefox?

Ryan Kelly [:rfkelly]

Reporter

Comment 9

•

11 years ago

(I mean specifically the failing tests in Bug 1066493)

Ryan Kelly [:rfkelly]

Reporter

Comment 10

•

11 years ago

Assuming it's actually done through firefox, this would involve setting the about:config setting "services.sync.tokenServerURI" to https://token.stage.mozaws.net/1.0/sync/1.5

Ryan Kelly [:rfkelly]

Reporter

Comment 11

•

11 years ago

After a few hiccups gettings loads to work, I have finally kicked off a basic tokenserver+sync loadtest against this stack: https://loads.services.mozilla.com/run/cfe5936f-187a-41bc-bade-f728cd8b0d01

Ryan Kelly [:rfkelly]

Reporter

Comment 12

•

11 years ago

Aaand it's giving a bunch of errors, so I killed it. Will dig in...

Ryan Kelly [:rfkelly]

Reporter

Comment 13

•

11 years ago

I'm getting DNS errors while running the loadtest, and it appears to be trying to connect to this following endpoint: https:sync-1-us-east-1.stage.mozaws.net/1.5/17194 Which is malformed, should be "https://" rather than just "http:". Bob, can you please check the node definitions in the tokenserver database?

Bob Micheletto [:bobm]

Assignee

Comment 14

•

11 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #13) > I'm getting DNS errors while running the loadtest, and it appears to be > trying to connect to this following endpoint: > > https:sync-1-us-east-1.stage.mozaws.net/1.5/17194 > > Which is malformed, should be "https://" rather than just "http:". Bob, can > you please check the node definitions in the tokenserver database? It was just sync 1 for some reason. It has been fixed. +-------------------------------------------+ | node | +-------------------------------------------+ | https://sync-0-us-east-1.stage.mozaws.net | | https://sync-1-us-east-1.stage.mozaws.net | | https://sync-2-us-east-1.stage.mozaws.net | | https://sync-3-us-east-1.stage.mozaws.net | | https://sync-4-us-east-1.stage.mozaws.net | +-------------------------------------------+

Ryan Kelly [:rfkelly]

Reporter

Comment 15

•

11 years ago

Thanks Bob, local tests look healthier, I've kicked off a fresh loads run here: https://loads.services.mozilla.com/run/dc12e2d5-43fb-444b-95e8-d39b31038e14

Ryan Kelly [:rfkelly]

Reporter

Comment 16

•

11 years ago

Loadtest shows a number of 503s from the tokenserver, but no errors from the sync nodes. I'll try an isolated loadtest run against sync-1 for comparison; see https://loads.services.mozilla.com/run/439750b5-4d14-4ded-8907-26b4ea6678e6

Ryan Kelly [:rfkelly]

Reporter

Comment 17

•

11 years ago

The single-node test looks good, zero errors and solidly above 350 RPS. Errors in the combined test most likely due to something on the tokenserver side, which we can dive back into in QA for Bug 1091313.

Ryan Kelly [:rfkelly]

Reporter

Comment 18

•

11 years ago

Marking this fixed, after discussion with Bob and Karl in IRC today we're happy to roll this to production.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 19

•

11 years ago

Hi Ryan. Sorry that I haven't had the time the last days to check this change with our CI. It's a bit complicated to get the staging sync server to be used. I wonder if I should still do a check, of if I should have a look at our tests once the change on this bug has been deployed to production?

Flags: needinfo?(rfkelly)

Karl Thiessen [:kthiessen, he/him]

Comment 20

•

11 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #18) > Marking this fixed, after discussion with Bob and Karl in IRC today we're > happy to roll this to production. In most other projects I'm involved in here, this means that I should now file the production deploy ticket. Is that true here as well, or is there a different process to follow?

Ryan Kelly [:rfkelly]

Reporter

Comment 21

•

11 years ago

No worries Henrik, at this stage it sounds like it'll be easier to just wait for the production deploy. Hopefully it will actually have the desired effect. Karl, yes, please mark this as RESOLVED/VERIFIED and file a follow-up production bug.

Flags: needinfo?(rfkelly)

Karl Thiessen [:kthiessen, he/him]

Updated

•

11 years ago

Blocks: 1105529

Karl Thiessen [:kthiessen, he/him]

Comment 22

•

11 years ago

Verified. Production ticket is bug 1105529.

Status: RESOLVED → VERIFIED