Expired TLS certificates on sync storage nodes in stage
Categories
(Cloud Services Graveyard :: Server: Sync, defect)
Tracking
(Not tracked)
People
(Reporter: rfkelly, Assigned: bobm)
References
Details
Trying to access https://sync-4-us-east-1.stage.mozaws.net gives me a certificate warning:
Websites prove their identity via certificates, which are valid for a set time period. The certificate for sync-4-us-east-1.stage.mozaws.net expired on 4/16/2020.
Error code: SEC_ERROR_EXPIRED_CERTIFICATE
This is causing problems with some sync integration tests that run on stage, such as https://github.com/mozilla-mobile/firefox-ios/issues/6554
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #0)
Websites prove their identity via certificates, which are valid for a set time period. The certificate for sync-4-us-east-1.stage.mozaws.net expired on 4/16/2020.
We did not update this server because it's not presently functioning, and is marked as down in the staging token server.
This is causing problems with some sync integration tests that run on stage, such as https://github.com/mozilla-mobile/firefox-ios/issues/6554
The only node that is marked as up in Tokenserver stage is the Durable Sync node. Testing should no longer be pointed at the staging py-sync nodes. How can we get everything pointed at Durable Sync?
Assignee | ||
Comment 2•5 years ago
|
||
I've fixed Sync node 4. But I still think it would be best to no longer point at nodes directly.
Reporter | ||
Comment 3•4 years ago
|
||
I don't think the test is deliberately pointing at this specific node, I wonder if the user is somehow still assigned to that node in stage tokenserver config despite it being marked as down.
:isabel_rios, does the failing test here always use the same Firefox Account, or does it create a fresh one each time?
Comment 4•4 years ago
|
||
For each test we create a new firefox stage account using fxa cli tool (https://pypi.org/project/fxacli/). That account is also verified while configuring the tests. After the test, the account is removed.
Thanks for your help with this issue!
Assignee | ||
Comment 5•4 years ago
|
||
(In reply to Isabel Rios[:isabel_rios] from comment #4)
For each test we create a new firefox stage account using fxa cli tool (https://pypi.org/project/fxacli/). That account is also verified while configuring the tests. After the test, the account is removed.
In that case, this shouldn't have been broken. Because 100% of new staging accounts have been routed to Spanner since mid-March. So one of these assumptions is broken:
- A new account is created for every test listed above.
- 100% of new staging accounts are routed to Spanner.
I can check on the second of those.
Assignee | ||
Comment 6•4 years ago
•
|
||
(In reply to Bob Micheletto [:bobm] from comment #5)
I can check on the second of those.
spanner_node_id = 73
migrate_new_user_percentage = 10
The 10% routing is only part of the problem. Sync node 4 is marked both down and backoff. So, it shouldn't be receiving new users in any case. We should probably file a Tokenserver bug to investigate that, since it could be problematic in production.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 7•4 years ago
•
|
||
The new user percentage was moved to a table in the database, and has been set to 100%:
MySQL [tokenserver]> SELECT * FROM dynamic_settings;
+-----------------------------+-------+--------------------------------------------+
| setting | value | description |
+-----------------------------+-------+--------------------------------------------+
| migrate_new_user_percentage | 100 | percent of new users to migrate to spanner |
+-----------------------------+-------+--------------------------------------------+
Reporter | ||
Comment 8•4 years ago
|
||
Sync node 4 is marked both down and backoff. So, it shouldn't be receiving new users in any case.
We should probably file a Tokenserver bug to investigate that, since it could be problematic in production.
Interesting. Tokenserver appears to be trying to skip downed nodes when selecting a new node:
But perhaps this isn't working correctly.
Reporter | ||
Comment 9•4 years ago
|
||
I tried creating a few new accounts in stage, and AFAICT they are correctly assigned to https://stage.sync.nonprod.cloudops.mozgcp.net/. So, I'm not entirely sure what's going on here.
One thing to note is, I don't think tokenserver has any clever handling of existing users who are assigned to downed nodes. If you mark a node as downed, tokenserver will keep telling its existing inhabitants to go there for their sync data, until they are moved off that node by some other mechanism (such as via the unassign_node.py
helper script).
If the intention of downed=True
is that the node is dead and is never coming back, it may be worth adding a bit of logic to re-assign users on on demand if it discovers they're on a downed node. (Or maybe not, if we think we'll stop having this problem once durable sync is fully rolled out)
Assignee | ||
Comment 10•4 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #9)
If the intention of
downed=True
is that the node is dead and is never coming back, it may be worth adding a bit of logic to re-assign users on on demand if it discovers they're on a downed node. (Or maybe not, if we think we'll stop having this problem once durable sync is fully rolled out)
I think we're okay as is for now. I'm going to close this bug out since there doesn't seem to be anything else to do for the moment. Let's re-open if that changes.
Updated•2 years ago
|
Description
•