Closed Bug 1589565 Opened 6 years ago Closed 6 years ago

Please deploy tokenserver 1.4.4

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rfkelly, Assigned: jrgm)

References

Details

Attachments

(1 file)

As part of our effort to accurately report and track the timestamp at which a users keys were rotated, this release of tokenserver loosens some restrictions on the data that must be provided by OAuth sync clients. It's step (1) of 3 in the above-linked github issue.

To effectively QA this change, we should confirm that Desktop and the Lockwise mobile app can sync successfully with the user's account, both before and after changing the account password.

Cross-linking for context, we need to have this stable in production before merging https://github.com/mozilla/fxa/pull/2570 on the FxA side of things.

Whoops, I somehow managed to merge the intended commit into a different branch rather than into master, stand by for a v1.4.4 to fix it...

Flags: needinfo?(rfkelly)

tokenserver v1.4.4 is up on docker hub now

Flags: needinfo?(rfkelly)
Summary: Please deploy tokenserver 1.4.3 → Please deploy tokenserver 1.4.4
Assignee: nobody → jrgm

Tokenserver v1.4.5 (note extra version bump to pick up a startup bug) is in stage. I've run a small loadtest on it, and that appears normal. Is there QA assigned to test the points in comment #0?

$ curl -s https://token.stage.mozaws.net/__version__ | python -mjson.tool
{
"build": "https://circleci.com/gh/mozilla-services/tokenserver/268",
"commit": "8d0df2b26a4a7bd3d47fa36284ee26aef9c39de2",
"source": "https://github.com/mozilla-services/tokenserver",
"version": "1.4.5"
}

$ curl -s https://token.stage.mozaws.net/ | python -mjson.tool
{
"auth": "https://token.stage.mozaws.net",
"browserid": {
"allowed_issuers": [
"api.accounts.firefox.com",
"api-accounts.stage.mozaws.net",
"api-accounts-dev.stage.mozaws.net",
"api-accounts.dev.lcip.org",
"api-accounts-latest.dev.lcip.org",
"nightly.dev.lcip.org",
"latest.dev.lcip.org",
"mockmyid.com",
"mockmyid.s3-us-west-2.amazonaws.com"
],
"trusted_issuers": [
"api.accounts.firefox.com",
"api-accounts.stage.mozaws.net",
"api-accounts-dev.stage.mozaws.net",
"api-accounts.dev.lcip.org",
"api-accounts-latest.dev.lcip.org",
"nightly.dev.lcip.org",
"latest.dev.lcip.org"
]
},
"oauth": {
"default_issuer": "api-accounts.stage.mozaws.net",
"scope": "https://identity.mozilla.com/apps/oldsync",
"server_url": "https://mock-oauth-stage.dev.lcip.org/v1/"
},
"services": {
"sync": [
"1.5"
]
}
}
$

Is there QA assigned to test the points in comment #0?

I'm a bit out of the loop on tokenserver QA these days; Rebecca, is this still something you can help us wrangle?

Flags: needinfo?(rbillings)

(In reply to Ryan Kelly [:rfkelly] from comment #5)

Is there QA assigned to test the points in comment #0?

I'm a bit out of the loop on tokenserver QA these days; Rebecca, is this still something you can help us wrangle?

Rebecca's on PTO.

Peter, can you help with this?

Flags: needinfo?(rbillings) → needinfo?(pdehaan)

Hi, I can take this bug for verification. Please let me know what settings (prefs) should be made for testing on stage environment.

Flags: needinfo?(pdehaan)

Thanks, Vasilica.

To test:

  1. For Desktop - Follow these instructions for testing against staging.

  2. For Lockwise iOS and Android - at the moment you'll need to do a custom local build of each one to test, with the staging sync url hardcoded into the builds. We're putting together more clear steps for you on how to do this, which we'll get to you shortly. In the meantime however, you should be able to start testing Desktop. Since they use the same components, we're comfortable moving forward testing one of these - not necessarily both.

Tested around Lockwise feature on Firefox 70.0.1 between Windows 10 64-bit and Mac OS X 10.14.6 profiles and all the passwords were successfully synced, both before and after changing the password (verified using two methods: changed from FxA settings and via forgot password flow using the recovery key)

Waiting for further instructions for android and ios, in order to verify also across this platforms.

Hi Vasilica,

Alrighty, so we've got some updated instructions for you to test iOS against staging. That should be enough to confirm we're good to go with this release, since both iOS and Android use the same shared rust components here.

And the good news here is you don't need to do any custom local builds. If you refer back to this document, you'll find there's updated instructions under "Using staging on ios". Follow those instructions to point firefox iOS to staging, and we should be good to go here.

Let me know if you need anything else on our end, and thanks so much!

Adding ni? for rfk at his request: question is, is running with a canary of 1.4.5 and the rest on lower version a compatible mode, or is it preferable to not remain in "dual version" for very long?

Flags: needinfo?(rfkelly)

is running with a canary of 1.4.5 and the rest on lower version a compatible mode, or is it preferable to not remain in "dual version" for very long?

Thank :jrgm; I expect it will be OK do run in "dual mode". The main change here removes some consistency checks around generation numbers, because we expect that OAuth clients will soon start failing those checks. But that won't happen until we ship https://github.com/mozilla/fxa/pull/2570 to production in FxA, so I don't think there's any risk to have some tokenserver servers with the checks and some without in the meantime.

Flags: needinfo?(rfkelly)
Attached file sync-error.txt

Seems that the password syncing does not work correctly when running the following steps:

  • Make some changes on desktop
  • Sync on desktop
  • Change password on desktop (The user is not automatically logged out from fennec and there is no notification displayed)
  • Go to Manage account on iOS - > relogin
  • Sync on iOS

Actual results: passwords do not match

Notes: the passwords are synced after a log out/login on fennec + sync on iOS

I’ve also received a sync-error but I’m not sure if it is related to the above scenario.

Tested between Windows 10 64-bit and iOS 13.1.2.

Adding ni? for rfk here; Ryan, I can take a look too to try and confirm this is also an issue later today if that'd be helpful...

Vasilica; are you seeing this for passwords only, or all sync data between iOS and Desktop? And, just to confirm; could you paste your about:sync-log here as well to help us investigate this one?

Flags: needinfo?(rfkelly)

Vasilica,

A few more clarifying questions for you here:

  1. When you say, "change password on desktop", are you talking about your FxA password?
  2. When you're testing against iOS, can you confirm you've followed the steps to point iOS to staging?
Flags: needinfo?(vasilica.mihasca)

Ok, so I think I figured out what's going on here.

Some of the sync engines in mobile (the newer ones, using the Rust components) aren't wired up to allow pointing to stage. We at least still do have a few older sync engines ("history" being the one I've confirmed so far) that does allow overwriting the server in iOS to point to stage. So...what I think needs to be tested here (:rfk, would love your input as well) is that you can Sync across Desktop <-> say, iOS using say, History.

I've confirmed that syncing History DOES work against staging in iOS after changing FxA password (while say "bookmarks" do not, which would be explained by the fact that they're using Rust components in iOS).

(In reply to Rachel Tublitz [:tublitzed] from comment #16)

I've confirmed that syncing History DOES work against staging in iOS after changing FxA password (while say "bookmarks" do not, which would be explained by the fact that they're using Rust components in iOS).

I think this observation is correct. The code only changes the endpoints for those engines that are written in Swift and not for those in Rust. It is unclear if that was something we missed or if the Rust code is missing the option to set different endpoints. Looping in :garvan to take a peek at the code for a final conclusion.

How urgent is this?

Flags: needinfo?(gkeeley)

How urgent is this?

I don't think we should block this bug on fixing an issue in Firefox for iOS; leaving ni? myself to suggest an alternate QA path for the behaviour in question (and to try it to confirm that it works!)

In Firefox iOS, this is where the staging server URLs are specified
https://github.com/mozilla-mobile/firefox-ios/blob/master/Account/FirefoxAccountConfiguration.swift#L30

:garvan did a bit of testing on Firefox for iOS and it appears that it can in fact successfully talk to the stage tokenserver, so I'm not sure what's going on here and will need to dig in further.

Some of the sync engines in mobile (the newer ones, using the Rust components) aren't wired up to allow pointing to stage

To add some more detail here: the rust sync-and-storage components themselves are capable of talking to stage or to self-hosted servers, but they have to be told explicitly to do so by the app. Most of the consumers of these components currently don't expose any facility for doing that, but it looks like Firefox for iOS does.

That said, it appears both Rachel and Garvan tried to sync bookmarks from Firefox iOS to stage, and the sync failed, so there's something going on here. I'll see if I can repro myself.

Flags: needinfo?(gkeeley)

So I'm holding on proceeding until some of the issues noted above are resolved or determined to be ok to proceed with. Let me know how things look, or if there is something you need from me in the stage environment.

We can look at deploying tomorrow morning if it's all ready to go.

I was able to configure our command-line logins sync demo client to talk to stage, and using that, was able to successfully sync logins between a rust-components client and desktop Firefox. I'm going to attach a screengrab of the session for completeness.

Everything synced correctly and demonstrated the behaviour I was expecting. In particular, I was able to manually edit the credentials being used by the rust-components client in order to give it an incorrect generation number, and tokenserver allowed it to proceed through to sync. This is the intended effect of the change being deployed here, and if I try the same trick on the production servers, tokenserver rejects it.

So I'm happy that functionality is working as expected here.

However, during my test session I did manage to trigger a single 503 Service Unavailabe from the tokenserver. It might have been a database blip, or it might have been some unhandled edge-case in the server code. :jrgm, could you please check the server logs for any corresponding traceback, at around 5 or 10 minutes prior to the timestamp of this comment?

Flags: needinfo?(rfkelly) → needinfo?(jrgm)

Here's a recording of what I did for completeness: https://drive.google.com/open?id=1fXZUZnNncg8GKmNH42-53-1XVDExocK1

(In reply to Rachel Tublitz [:tublitzed] from comment #14)

Vasilica; are you seeing this for passwords only, or all sync data between iOS and Desktop?

This happens for all sync data and it seems more like a FxA issue. I also filed an issue in GitHub ( https://github.com/mozilla/fxa/issues/3283 ) for more investigations.

And, just to confirm; could you paste your about:sync-log here as well to help us investigate this one?

I already attached the about:sync-log file is Comment #13

When you say, "change password on desktop", are you talking about your FxA password?

Yes. The FxA password changed from Manage Account page

When you're testing against iOS, can you confirm you've followed the steps to point iOS to staging?

The environment was set following the instructions on "Using staging on ios" paragraph from https://docs.google.com/document/d/1KnFdE1ee-yv8En5QaPoRj3x2s9kTESoW0lGNh2LwIrI/edit#heading=h.q75o2k4vzjg

Flags: needinfo?(vasilica.mihasca)

The 503 details are from a failure on a request to https://mock-oauth-stage.dev.lcip.org

Nov 08 07:12:47 ip-172-31-44-212.ec2.internal docker-tokenserver[6010]: {"error": "ConnectionError(u"Verification request to https://mock-oauth-stage.dev.lcip.org/v1/ failed; reason: ('Connection aborted.', error(104, 'Connection reset by peer'))",)", "traceback": "Uncaught exception:
File "/app/tokenserver/views.py", line 202, in _validate_oauth_token
token = verifier.verify(token)
File "/app/tokenserver/verifiers.py", line 231, in verify
raise ConnectionError(msg)
<class 'browserid.errors.ConnectionError'>
ConnectionError(u"Verification request to https://mock-oauth-stage.dev.lcip.org/v1/ failed; reason: ('Connection aborted.', error(104, 'Connection reset by peer'))",)
", "time": "2019-11-08T07:12:47.550036Z", "v": 1, "message": "Unexpected verification error", "hostname": "ip-172-31-44-212.ec2.internal", "pid": 15, "op": "tokenserver", "name": "tokenserver"}

Nov 08 07:12:47 ip-172-31-44-212.ec2.internal docker-tokenserver[6010]: {"token.oauth.verify_failure": 1, "code": 503, "v": 1, "pid": 15, "agent": "reqwest/0.9.22", "token.oauth.connection_error": 1, "path": "https://token.stage.mozaws.net/1.0/sync/1.5", "tokenserver.oauth.verify": 0.00575709342956543, "name": "mozsvc.metrics", "request_time": 0.0065190792083740234, "remoteAddressChain": ["121.200.7.201", "172.31.42.210", "127.0.0.1"], "method": "GET", "time": "2019-11-08T07:12:47.550758Z", "hostname": "ip-172-31-44-212.ec2.internal", "op": "mozsvc.metrics"}

Flags: needinfo?(jrgm)

So, I can't explain why but the above was a network error in invoking the api gateway for that endpoint. I don't see anything in the metrics for that api gateway or in the lambda handler that indicates it was having a problem. I can't explain why there'd be a network level error in this case. But I'm thinking this transient error is not a blocker to what rfk was demonstrating.

But I'm thinking this transient error is not a blocker to what rfk was demonstrating.

I agree, this seems like a spurious error unrelated to the change in question.

After chatting with some folks, this deploy to production will happen Monday morning PST.

Sounds good, thanks :jrgm!

I already attached the about:sync-log file is Comment #13

The file attached in Comment 13 appears to be a copy of services/sync/modules/policies.js from mozilla-central rather than a sync error log - Vasilica any chance you still have the profile and are able to double-check whether this is indeed what was in the sync error logs for that run?

Flags: needinfo?(vasilica.mihasca)

Stefan, since Vasi is out PTO today, any chance you can get us the sync-log here?

Flags: needinfo?(stefan.deiac)

This was put in production this afternoon about 1530 PST. Note: that's with a purge task disabled, and I'll have to work out how to get that caught up but the main functionality is in place.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

thanks, John

Thanks for covering QA Ica

Depends on: 1596378

(In reply to Ryan Kelly [:rfkelly] from comment #20)

That said, it appears both Rachel and Garvan tried to sync bookmarks from Firefox iOS to stage, and the sync failed, so there's something going on here. I'll see if I can repro myself.

I tested the bookmarks using the spanner account in the production version of FF for iOS and I can't sync any bookmarks. It seems that the iOS bookmarks are not synced with other OS's and vice-versa.
Is there any ticket about this issue or should I open a new one?
Notes:

  • History and Passwords are working properly.
Flags: needinfo?(vasilica.mihasca)
Flags: needinfo?(stefan.deiac)
Flags: needinfo?(rfkelly)

Is there any ticket about this issue or should I open a new one?

This is consistent with what Garvan observed above as well. Please do file a separate issue for us to dig in to this more. Thanks!

Flags: needinfo?(rfkelly)
Depends on: 1597192
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: