Closed Bug 1542798 Opened 5 years ago Closed 5 years ago

Auth0 login doesn't work when pulse is down

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: hassan, Unassigned)

References

Details

When pulse went down, according to the post-mortem, users were not able to login via Auth0 on the tools site.

Possible reason: When logging in via Auth0 in tc-tools or tc-ui, part of the process requires fetching the user's taskcluster credentials. In order to get the credentials, taskcluster-client is used which stops working when pulse is down, hence the inability of logging in.

Possible solution: Allow users to authenticate via Auth0 and return an error when the retrieval of taskcluster crendentials fails.

Dustin, did you try to login when pulse was down? I'm curious to know what a user sees in the UI when they attempt to login while pulse is down. Does the UI show the user as logged in or does it hide all content and show an error instead?

Flags: needinfo?(dustin)

Pulse being down brings most of Taskcluster down, since everything that sends a pulse message ends up waiting for pulse and then retrying. So the issue was that auth was down -- login was issuing credentials just fine, but auth wasn't accepting them.

Pulse is as much a core component of Taskcluster as Heroku or EC2 -- if Pulse is down, Taskcluster is down. I don't think there's much we can do to optimize that situation, particularly since it will manifest a little differently every time.

If I recall, the key problem was that users couldn't login to treestatus, which is part mozilla-releng.net. That system is using the federated-login approach we have in place where it trades an Auth0 access_token for taskcluster credentials, and then uses those TC credentials for access control (deciding who can and cannot change tree status). It was that last call (in particular, calling auth.currentScopes) that failed. Especially for a system that should be usable even when other things (like TC) are failing, a dependency on TC might not be a good design choice.

Flags: needinfo?(dustin)

Pulse being down brings most of Taskcluster down, since everything that sends a pulse message ends up waiting for pulse and then retrying. So the issue was that auth was down -- login was issuing credentials just fine, but auth wasn't accepting them.

What I understand from this is that authenticating via auth0 in tools works however authorization was not successful because auth.currentScopes doesn't work when pulse is down. This would lead users to see themselves as logged in but they would have an empty set of scopes. Please correct me if my understanding is wrong.

Pulse is as much a core component of Taskcluster as Heroku or EC2 -- if Pulse is down, Taskcluster is down. I don't think there's much we can do to optimize that situation, particularly since it will manifest a little differently every time.

Agreed.

If I recall, the key problem was that users couldn't login to treestatus, which is part mozilla-releng.net. That system is using the federated-login approach we have in place where it trades an Auth0 access_token for taskcluster credentials, and then uses those TC credentials for access control (deciding who can and cannot change tree status).

Agreed. It makes sense for treestatus to remove its dependency on Taskcluster.

Pretty much!

This would lead users to see themselves as logged in but they would have an empty set of scopes.

Hopefully not -- the auth.currentScopes call should fail, rather than returning an empty list. But I don't know the treestatus code, and it might have a bug that gets this wrong.

It makes sense for treestatus to remove its dependency on Taskcluster.

I think it's a lot easier to get group membership from the People API now, so that's probably a good direction to move in.

Blocks: 1542805

The new federated sign-in process (bug 1561905) will address this as much as possible.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.