Closed Bug 1637171 Opened 5 years ago Closed 4 years ago

Expired TLS certificates on sync storage nodes in stage

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rfkelly, Assigned: bobm)

References

Details

Trying to access https://sync-4-us-east-1.stage.mozaws.net gives me a certificate warning:

Websites prove their identity via certificates, which are valid for a set time period. The certificate for sync-4-us-east-1.stage.mozaws.net expired on 4/16/2020.
 
Error code: SEC_ERROR_EXPIRED_CERTIFICATE

This is causing problems with some sync integration tests that run on stage, such as https://github.com/mozilla-mobile/firefox-ios/issues/6554

Assignee: nobody → bobm

(In reply to Ryan Kelly [:rfkelly] from comment #0)

Websites prove their identity via certificates, which are valid for a set time period. The certificate for sync-4-us-east-1.stage.mozaws.net expired on 4/16/2020.

We did not update this server because it's not presently functioning, and is marked as down in the staging token server.

This is causing problems with some sync integration tests that run on stage, such as https://github.com/mozilla-mobile/firefox-ios/issues/6554

The only node that is marked as up in Tokenserver stage is the Durable Sync node. Testing should no longer be pointed at the staging py-sync nodes. How can we get everything pointed at Durable Sync?

Flags: needinfo?(rfkelly)

I've fixed Sync node 4. But I still think it would be best to no longer point at nodes directly.

I don't think the test is deliberately pointing at this specific node, I wonder if the user is somehow still assigned to that node in stage tokenserver config despite it being marked as down.

:isabel_rios, does the failing test here always use the same Firefox Account, or does it create a fresh one each time?

Flags: needinfo?(rfkelly) → needinfo?(irios.mozilla)

For each test we create a new firefox stage account using fxa cli tool (https://pypi.org/project/fxacli/). That account is also verified while configuring the tests. After the test, the account is removed.

Thanks for your help with this issue!

Flags: needinfo?(irios.mozilla)

(In reply to Isabel Rios[:isabel_rios] from comment #4)

For each test we create a new firefox stage account using fxa cli tool (https://pypi.org/project/fxacli/). That account is also verified while configuring the tests. After the test, the account is removed.

In that case, this shouldn't have been broken. Because 100% of new staging accounts have been routed to Spanner since mid-March. So one of these assumptions is broken:

  • A new account is created for every test listed above.
  • 100% of new staging accounts are routed to Spanner.

I can check on the second of those.

(In reply to Bob Micheletto [:bobm] from comment #5)

I can check on the second of those.

spanner_node_id = 73
migrate_new_user_percentage = 10

The 10% routing is only part of the problem. Sync node 4 is marked both down and backoff. So, it shouldn't be receiving new users in any case. We should probably file a Tokenserver bug to investigate that, since it could be problematic in production.

Status: NEW → ASSIGNED

The new user percentage was moved to a table in the database, and has been set to 100%:

MySQL [tokenserver]> SELECT * FROM dynamic_settings;
+-----------------------------+-------+--------------------------------------------+
| setting                     | value | description                                |
+-----------------------------+-------+--------------------------------------------+
| migrate_new_user_percentage | 100   | percent of new users to migrate to spanner |
+-----------------------------+-------+--------------------------------------------+

Sync node 4 is marked both down and backoff. So, it shouldn't be receiving new users in any case.
We should probably file a Tokenserver bug to investigate that, since it could be problematic in production.

Interesting. Tokenserver appears to be trying to skip downed nodes when selecting a new node:

https://github.com/mozilla-services/tokenserver/blob/f19ac0e8402b8203c1db51e9321ea50cb361f634/tokenserver/assignment/sqlnode/sql.py#L687

But perhaps this isn't working correctly.

I tried creating a few new accounts in stage, and AFAICT they are correctly assigned to https://stage.sync.nonprod.cloudops.mozgcp.net/. So, I'm not entirely sure what's going on here.

One thing to note is, I don't think tokenserver has any clever handling of existing users who are assigned to downed nodes. If you mark a node as downed, tokenserver will keep telling its existing inhabitants to go there for their sync data, until they are moved off that node by some other mechanism (such as via the unassign_node.py helper script).

If the intention of downed=True is that the node is dead and is never coming back, it may be worth adding a bit of logic to re-assign users on on demand if it discovers they're on a downed node. (Or maybe not, if we think we'll stop having this problem once durable sync is fully rolled out)

(In reply to Ryan Kelly [:rfkelly] from comment #9)

If the intention of downed=True is that the node is dead and is never coming back, it may be worth adding a bit of logic to re-assign users on on demand if it discovers they're on a downed node. (Or maybe not, if we think we'll stop having this problem once durable sync is fully rolled out)

I think we're okay as is for now. I'm going to close this bug out since there doesn't seem to be anything else to do for the moment. Let's re-open if that changes.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
See Also: → 1648581
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.