Auth0 login doesn't work when pulse is down
Categories
(Taskcluster :: Services, defect)
Tracking
(Not tracked)
People
(Reporter: hassan, Unassigned)
References
Details
When pulse went down, according to the post-mortem, users were not able to login via Auth0 on the tools site.
Possible reason: When logging in via Auth0 in tc-tools or tc-ui, part of the process requires fetching the user's taskcluster credentials. In order to get the credentials, taskcluster-client is used which stops working when pulse is down, hence the inability of logging in.
Possible solution: Allow users to authenticate via Auth0 and return an error when the retrieval of taskcluster crendentials fails.
Reporter | ||
Comment 1•5 years ago
|
||
Dustin, did you try to login when pulse was down? I'm curious to know what a user sees in the UI when they attempt to login while pulse is down. Does the UI show the user as logged in or does it hide all content and show an error instead?
Comment 2•5 years ago
|
||
Pulse being down brings most of Taskcluster down, since everything that sends a pulse message ends up waiting for pulse and then retrying. So the issue was that auth was down -- login was issuing credentials just fine, but auth wasn't accepting them.
Pulse is as much a core component of Taskcluster as Heroku or EC2 -- if Pulse is down, Taskcluster is down. I don't think there's much we can do to optimize that situation, particularly since it will manifest a little differently every time.
If I recall, the key problem was that users couldn't login to treestatus, which is part mozilla-releng.net. That system is using the federated-login approach we have in place where it trades an Auth0 access_token for taskcluster credentials, and then uses those TC credentials for access control (deciding who can and cannot change tree status). It was that last call (in particular, calling auth.currentScopes) that failed. Especially for a system that should be usable even when other things (like TC) are failing, a dependency on TC might not be a good design choice.
Reporter | ||
Comment 3•5 years ago
|
||
Pulse being down brings most of Taskcluster down, since everything that sends a pulse message ends up waiting for pulse and then retrying. So the issue was that auth was down -- login was issuing credentials just fine, but auth wasn't accepting them.
What I understand from this is that authenticating via auth0 in tools works however authorization was not successful because auth.currentScopes
doesn't work when pulse is down. This would lead users to see themselves as logged in but they would have an empty set of scopes. Please correct me if my understanding is wrong.
Pulse is as much a core component of Taskcluster as Heroku or EC2 -- if Pulse is down, Taskcluster is down. I don't think there's much we can do to optimize that situation, particularly since it will manifest a little differently every time.
Agreed.
If I recall, the key problem was that users couldn't login to treestatus, which is part mozilla-releng.net. That system is using the federated-login approach we have in place where it trades an Auth0 access_token for taskcluster credentials, and then uses those TC credentials for access control (deciding who can and cannot change tree status).
Agreed. It makes sense for treestatus to remove its dependency on Taskcluster.
Comment 4•5 years ago
|
||
Pretty much!
This would lead users to see themselves as logged in but they would have an empty set of scopes.
Hopefully not -- the auth.currentScopes
call should fail, rather than returning an empty list. But I don't know the treestatus code, and it might have a bug that gets this wrong.
It makes sense for treestatus to remove its dependency on Taskcluster.
I think it's a lot easier to get group membership from the People API now, so that's probably a good direction to move in.
Comment 5•5 years ago
|
||
The new federated sign-in process (bug 1561905) will address this as much as possible.
Description
•