Closed Bug 1406322 Opened 8 years ago Closed 7 years ago

Taskcluster auth0 integration seems to be broken

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

References

Details

Attachments

(1 file)

No description provided.
See https://sentry.prod.mozaws.net/operations/taskcluster-login/issues/656340/ It looks like this crash is triggered by an event, so I'm not sure which line(s) of the code this originates from, but presumably something inside https://github.com/taskcluster/taskcluster-login/blob/master/src/authn/auth0.js may be the cause.
We are also having failures such as this one (screenshot attached) when visiting https://tools.taskcluster.net/credentials/ This also happens after logging out, and logging back in again.
Flags: needinfo?(dustin)
Sample from https://papertrailapp.com/systems/taskcluster-login/events Oct 06 11:24:29 taskcluster-login app/web.1: 2017-10-06T09:24:29.055Z MozilliansAuthorizer found mozillians username grenade; vouched: true Oct 06 11:24:29 taskcluster-login app/web.1: 2017-10-06T09:24:29.064Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:29 taskcluster-login app/web.1: 2017-10-06T09:24:29.122Z MozilliansAuthorizer found mozillians username grenade; vouched: true Oct 06 11:24:30 taskcluster-login app/web.1: InsufficientAccessRightsError: InsufficientAccessRightsError Oct 06 11:24:30 taskcluster-login app/web.1: at messageCallback (/app/node_modules/ldapjs/lib/client/client.js:1419:45) Oct 06 11:24:30 taskcluster-login app/web.1: at Parser.onMessage (/app/node_modules/ldapjs/lib/client/client.js:1089:14) Oct 06 11:24:30 taskcluster-login app/web.1: at emitOne (events.js:115:13) Oct 06 11:24:30 taskcluster-login app/web.1: at Parser.emit (events.js:210:7) Oct 06 11:24:30 taskcluster-login app/web.1: at Parser.write (/app/node_modules/ldapjs/lib/messages/parser.js:111:8) Oct 06 11:24:30 taskcluster-login app/web.1: at TLSSocket.onData (/app/node_modules/ldapjs/lib/client/client.js:1076:22) Oct 06 11:24:30 taskcluster-login heroku/router: at=info method=GET path="/auth0/callback?code=Y3Tb7XB3eM_OT_HR" host=login.taskcluster.net request_id=afa849cb-9f44-4187-9aa9-6cd2992567c7 fwd="118.163.10.190" dyno=web.1 connect=1ms service=7266ms status=500 bytes=276 protocol=https Oct 06 11:24:30 taskcluster-login app/web.1: at emitOne (events.js:115:13) Oct 06 11:24:30 taskcluster-login app/web.1: at TLSSocket.emit (events.js:210:7) Oct 06 11:24:30 taskcluster-login app/web.1: at addChunk (_stream_readable.js:252:12) Oct 06 11:24:30 taskcluster-login app/web.1: at readableAddChunk (_stream_readable.js:239:11) Oct 06 11:24:30 taskcluster-login app/web.1: at TLSSocket.Readable.push (_stream_readable.js:197:10) Oct 06 11:24:30 taskcluster-login app/web.1: at TLSWrap.onread (net.js:589:20) Oct 06 11:24:31 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=8e7883cb-516c-413e-8413-65592a0528dd fwd="5.2.198.53" dyno=web.1 connect=0ms service=6218ms status=500 bytes=904 protocol=https Oct 06 11:24:31 taskcluster-login app/web.1: 2017-10-06T09:24:31.673Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:32 taskcluster-login app/web.1: 2017-10-06T09:24:32.096Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:32 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=499f7a75-3b29-413e-bb6f-48fd987c9eac fwd="5.2.198.53" dyno=web.1 connect=1ms service=6928ms status=500 bytes=904 protocol=https Oct 06 11:24:32 taskcluster-login app/web.1: 2017-10-06T09:24:32.480Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:32 taskcluster-login app/web.1: 2017-10-06T09:24:32.698Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:33 taskcluster-login app/web.1: 2017-10-06T09:24:33.116Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:33 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=acd17948-13ea-4173-a632-7c1d58f46814 fwd="5.2.198.53" dyno=web.1 connect=1ms service=7103ms status=500 bytes=904 protocol=https Oct 06 11:24:33 taskcluster-login app/web.1: 2017-10-06T09:24:33.501Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:33 taskcluster-login app/web.1: 2017-10-06T09:24:33.717Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:33 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=87fe110a-4980-44c2-90d4-a9b8ae126a66 fwd="5.2.198.53" dyno=web.1 connect=7ms service=7080ms status=500 bytes=904 protocol=https Oct 06 11:24:34 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=3eb276b8-f578-4ec7-ba5d-8449d6ef4956 fwd="5.2.198.53" dyno=web.1 connect=4ms service=6916ms status=500 bytes=903 protocol=https Oct 06 11:24:34 taskcluster-login app/web.1: 2017-10-06T09:24:33.910Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:34 taskcluster-login app/web.1: 2017-10-06T09:24:34.045Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:34 taskcluster-login app/web.1: 2017-10-06T09:24:34.122Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:34 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=d2ce59db-d195-4c79-9350-85221ea9f778 fwd="185.189.197.1" dyno=web.1 connect=0ms service=6987ms status=500 bytes=904 protocol=https Oct 06 11:24:34 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=e07d8871-261d-4df2-9371-64a76c7b0fc6 fwd="5.2.198.53" dyno=web.1 connect=0ms service=7079ms status=500 bytes=904 protocol=https Oct 06 11:24:34 taskcluster-login app/web.1: 2017-10-06T09:24:34.338Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:34 taskcluster-login app/web.1: 2017-10-06T09:24:34.402Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:34 taskcluster-login app/web.1: 2017-10-06T09:24:34.458Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:34 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=4416f490-8c3e-42cb-a543-f57d015a50d6 fwd="185.189.197.1" dyno=web.1 connect=1ms service=7493ms status=500 bytes=904 protocol=https Oct 06 11:24:34 taskcluster-login app/web.1: 2017-10-06T09:24:34.642Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:34 taskcluster-login app/web.1: 2017-10-06T09:24:34.821Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:35 taskcluster-login app/web.1: 2017-10-06T09:24:35.288Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:35 taskcluster-login app/web.1: 2017-10-06T09:24:35.387Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:35 taskcluster-login app/web.1: 2017-10-06T09:24:35.577Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:37 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=34b3492e-c1be-4faf-b01a-1ce40102f5c6 fwd="5.2.198.53" dyno=web.1 connect=1ms service=6052ms status=500 bytes=904 protocol=https Oct 06 11:24:37 taskcluster-login heroku/router: at=info method=GET path="/auth0/callback?code=Y3Tb7XB3eM_OT_HR" host=login.taskcluster.net request_id=0de78d7d-9fdb-44fc-b8aa-d5b3240d716d fwd="118.163.10.190" dyno=web.1 connect=0ms service=449ms status=500 bytes=276 protocol=https Oct 06 11:24:38 taskcluster-login app/web.1: Error Oct 06 11:24:38 taskcluster-login app/web.1: at Strategy.OAuth2Strategy.parseErrorResponse (/app/node_modules/passport-oauth2/lib/strategy.js:329:12) Oct 06 11:24:38 taskcluster-login app/web.1: at Strategy.OAuth2Strategy._createOAuthError (/app/node_modules/passport-oauth2/lib/strategy.js:376:16) Oct 06 11:24:38 taskcluster-login app/web.1: at /app/node_modules/passport-oauth2/lib/strategy.js:166:45 Oct 06 11:24:38 taskcluster-login app/web.1: at /app/node_modules/oauth/lib/oauth2.js:191:18 Oct 06 11:24:38 taskcluster-login app/web.1: at passBackControl (/app/node_modules/oauth/lib/oauth2.js:132:9) Oct 06 11:24:38 taskcluster-login app/web.1: at IncomingMessage.<anonymous> (/app/node_modules/oauth/lib/oauth2.js:157:7) Oct 06 11:24:38 taskcluster-login app/web.1: at emitNone (events.js:110:20) Oct 06 11:24:38 taskcluster-login app/web.1: at IncomingMessage.emit (events.js:207:7) Oct 06 11:24:38 taskcluster-login app/web.1: at endReadableNT (_stream_readable.js:1045:12) Oct 06 11:24:38 taskcluster-login app/web.1: at _combinedTickCallback (internal/process/next_tick.js:138:11) Oct 06 11:24:38 taskcluster-login app/web.1: at process._tickDomainCallback (internal/process/next_tick.js:218:9) Oct 06 11:24:38 taskcluster-login app/web.1: 2017-10-06T09:24:37.827Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:38 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=ba9c74a1-542e-46e9-8470-25967f14b7bd fwd="5.2.198.53" dyno=web.1 connect=1ms service=5679ms status=500 bytes=904 protocol=https Oct 06 11:24:38 taskcluster-login app/web.1: 2017-10-06T09:24:38.237Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:38 taskcluster-login app/web.1: 2017-10-06T09:24:38.492Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:38 taskcluster-login app/web.1: 2017-10-06T09:24:38.633Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:39 taskcluster-login heroku/router: at=info method=OPTIONS path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=55dc1f20-51e5-4e83-9db2-f814e513137d fwd="93.203.6.253" dyno=web.1 connect=2ms service=2ms status=200 bytes=449 protocol=https Oct 06 11:24:39 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=af858b56-6d45-4f2e-8d9f-600e9aa2d5d2 fwd="5.2.198.53" dyno=web.1 connect=0ms service=5167ms status=500 bytes=903 protocol=https Oct 06 11:24:39 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=7d3d73e5-1974-4822-bbd2-7354931dc051 fwd="5.2.198.53" dyno=web.1 connect=1ms service=5156ms status=500 bytes=904 protocol=https Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:38.922Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.001Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.121Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|pmoore Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.141Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|pmoore Oct 06 11:24:39 taskcluster-login heroku/router: at=info method=OPTIONS path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=82d417ef-61e4-4d23-aa39-98bcb031ef1a fwd="93.203.6.253" dyno=web.1 connect=1ms service=2ms status=200 bytes=449 protocol=https Oct 06 11:24:39 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=140fc6ce-0ace-4627-b113-3c7a0747bc0e fwd="5.2.198.53" dyno=web.1 connect=1ms service=5330ms status=500 bytes=904 protocol=https Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.216Z handlers.mozilla-auth0 received valid access_token for subject ad|Mozilla-LDAP|aselagea Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.301Z MozilliansAuthorizer found mozillians username aselagea; vouched: true Oct 06 11:24:39 taskcluster-login heroku/router: at=info method=GET path="/v1/oidc-credentials/mozilla-auth0" host=login.taskcluster.net request_id=9984e9de-b5fe-485b-a27e-848fd5f653f9 fwd="5.2.198.53" dyno=web.1 connect=0ms service=5161ms status=500 bytes=904 protocol=https Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.495Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.557Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/pmoore@mozilla.com Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.581Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/pmoore@mozilla.com Oct 06 11:24:39 taskcluster-login app/web.1: 2017-10-06T09:24:39.632Z MozilliansAuthorizer mozillians authorizing mozilla-ldap/aselagea@mozilla.com
I've restarted taskcluster-login dynos, to see if this helps.
Assignee: nobody → dustin
It looks like restarting taskcluster-login seems to have solved the issue for now - I am able to log in and view my credentials again. \o/
(Note, this wasn't just me having the problem, it was all users)
Those logs contain codes which could be used to impersonate the given users. The codes will have expired by now, so no sense marking the bug secret, but please be careful! Jabba, was there an issue with the LDAP system that might have caused this "InsufficientAccessRightsError"? It looks like Auth0 was returning errors at a similar time (the traceback in Strategy.OAuth2Strategy.parseErrorResponse).
Flags: needinfo?(dustin) → needinfo?(jdow)
Over the last week, we've seen Mozillians bomb out, and what looks like a similar error here, and maybe LDAP issues? We may need to think of ways to make the login service more resilient to such issues. Perhaps caching results from LDAP and/or Mozillians in a shared key/value store of some sort, and falling back to that when the service fails? That has some obvious security implications, of course..
I don't think LDAP issues are present here. The only time Auth0 talks to LDAP is when you type in the password. If Auth0 is re-authing a user (the screen where you just click your username to log in, but don't need to type in a password), it's done solely within Auth0. I'm not sure if real-time calls to mozillians.org are made, which could be an issue, or if auth0 itself is giving us issues (but I haven't heard of any issues with other sites using Auth0 during this time).
Flags: needinfo?(jdow)
I enabled LDAPClient in DEBUG for this service. A few things to note: * We have support for continually re-binding on the same connection, from back when we would bind as a user to check the password (pre-Okta). We no longer bind as anything but the bind user, but we still do it before each LDAP query. * Queries from a single webhead are serialized (and we only have one webhead) * We retry LDAP "operations" five times. The additional debug logging should show those retries.
Jabba can add detail about what's going on in the LDAP systems, but from my perspective it looks like the fix is to drop LDAP connections when an error occurs (during the retries).
I talked to :dustin on IRC a bit, and with his help, I was able to take a closer look (my earlier comment was incorrect, I misunderstood how taskcluster gets LDAP data). It turns out, it is LDAP's fault after all. The LDAP server in use here is an LDAP proxy, which contains no data of its own, but rather passes LDAP queries back to our datacenter LDAP servers. This generally works fine and hasn't given us any problems over the years, but we seem to have hit some sort of edge case and strange behavior: Taskcluster makes a connection to proxy. Proxy makes a connection to LDAP. Both connections are long-lived and have no timeout. When Taskcluster binds, it sends the bind request to the proxy, the proxy forwards it to the ldap server, and it's pretty transparent and everything works out. At some point on Friday, the connection between the proxy and the ldap server was interrupted, and it reconnected. This shouldn't be a problem, but it seems that that wedged the connection between Taskcluster and the LDAP server (via the proxy) in a way that caused subsequent queries to still be passed through, but without the bind credentials, so the backend LDAP server was seeing all the queries being done anonymously, hence returning err=50 (InsufficientAccessRights). Interestingly the proxy logs still showed the BIND on every single query, but the backend LDAP server didn't. Right when it started happening, there was one message in the proxy's log: Oct 6 04:43:44 ldapproxy2.dmz.scl3.mozilla.com slapd[28982]: conn=5449195 op=31 ldap_back_retry: retrying URI="ldap://ldap.db.scl3.mozilla.com" DN="uid=bind-taskcluster-login,ou=logins,dc=mozilla" Oct 6 04:43:44 ldapproxy2.dmz.scl3.mozilla.com slapd[28982]: conn=5449195 op=31 ldap_back_dobind_int: DN="uid=bind-taskcluster-login,ou=logins,dc=mozilla" without creds, binding anonymously which basically says exactly what it's doing, which gave me the clue. I cannot understand *why* it does this, instead of just outright failing. It very much seems like this is an edge case in the back-ldap code that wasn't tested well. I also see no reason why anyone would want this behavior. As to root cause, the only thing I can find is that there are about 6 seconds where no logs were coming into the backend ldap servers (the load balancers send a ping at least once per second for health checks), so I'm guessing we had some short network blip, or load balancer issue, that otherwise went unnoticed, but it caused this proxied connection to fail in a weird way. I don't see any way of avoiding this. The proper thing would be to connect directly to an LDAP server and not the proxy, but I don't have a way to offer that service at the moment, realistically. My suggestion here, if possible, would be to patch the taskcluster code to reconnect on error, or possibly build a new connection for every authentication attempt, rather than rebinding with the same connection over and over again, as unfortunately I cannot promise a timeline on when we'll have better LDAP infrastructure to be able to get rid of the proxies, which are at fault here.
Deployed PR#65.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
This happened again (but just once)..
Happening a lot..
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bug 1427866 might refactor thigns so that we don't query LDAP or Mozillians anymore, which would help out quite a bit..
Depends on: 1427866
Assignee: dustin → nobody
Is this still valid?
Component: Authentication → Login
Flags: needinfo?(dustin)
Nope
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Flags: needinfo?(dustin)
Resolution: --- → FIXED
Component: Login → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: