Balrog Admin reporting SSL errors

RESOLVED FIXED

Status

Release Engineering
Balrog: Backend
--
blocker
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: mostlygeek, Assigned: mostlygeek)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

a year ago
Post mortem for balrog-admin SSL errors: 

At about 9:45am UTC the balrog-admin server started logging errors like: 

> 2016/12/20 09:46:55 [error] 4298#0: *1329686 http_auth_ldap: Authentication timed out, client: ..., \
>   server: , request: "GET /api/releases/Thunderbird-comm-aurora-nightly-20161220004022 \
>   HTTP/1.1", host: "aus4-admin.mozilla.org"

This started causing problems for nightly submission around 14:00UTC.
After digging down and verifying that the LDAP connection was fine I restarted the nginx process on the admin server. 

This seems to have fixed the issues. It appears that the nginx ldap module is not 100% stable and there is a bug that causes this to happen. It's been about 17 days since we deployed the server so the problem does take some time to manifest itself. 

Creating this bug for posterity and to refer back to if this problem happens again.
(Assignee)

Updated

a year ago
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
Duplicate of this bug: 1324759
seems not fixed as some nightlys hit this again https://treeherder.mozilla.org/logviewer.html#?job_id=4506410&repo=mozilla-aurora
Status: RESOLVED → REOPENED
Flags: needinfo?(bwong)
Resolution: FIXED → ---
and on m-c
setting to blocker since we got now nightly user complaints about not getting updated
Severity: normal → blocker
(Assignee)

Comment 5

a year ago
See lots of errors like: 

http_auth_ldap: Authentication timed out, client: x.x.x.x, server: , request: "GET /api/releases/Firefox-mozilla-central-nightly-20161221030226 HTTP/1.1", host: "aus4-admin.mozilla.org"

restarting nginx seems to have resolved the issue. 
The nginx-auth-ldap we are running is commit [1] 49a8b4d28fc4a518563c82e0b52821e5f37db1fc which is fairly recently. 

[1] https://github.com/kvspb/nginx-auth-ldap/commit/49a8b4d28fc4a518563c82e0b52821e5f37db1fc
Flags: needinfo?(bwong)
can we retrigger all the nightlies and verify this works, I didn't see indication on treeherder of that being done yesterday.
Can we have a Nagios check for that particular issue? It's kinda bad that we have to wait for the next Nightly builds to figure out that this issue is still persistent.
FYI - My Nightly (2016-12-19) just updated to (2016-12-21).

Updated

a year ago
See Also: → bug 1325115
(In reply to Henrik Skupin (:whimboo) from comment #7)
> Can we have a Nagios check for that particular issue? It's kinda bad that we
> have to wait for the next Nightly builds to figure out that this issue is
> still persistent.

I'm not sure if this is what we want, but I do think it's important to take some action to prevent this or shorten the window before tomorrow's nightlies. I'll be talking with Benson later today about it.
(Assignee)

Comment 10

a year ago
I enabled debug logging for nginx errors. So if this happens again we can have more detailed information on why it is happening. 

The nginx logs show: 

09:12:41 "GET ... HTTP/1.1" 000 352 0 tbirdbld 10.009 - "python-requests/2.7.0 CPython/2.6.6 Linux/2.6.32-504.3.3.el6.x86_64"
09:12:54 "GET ... HTTP/1.1" 000 352 0 tbirdbld 10.002 - "python-requests/2.7.0 CPython/2.6.6 Linux/2.6.32-504.3.3.el6.x86_64"
09:12:06 "GET ... HTTP/1.1" 000 352 0 tbirdbld 10.005 - "python-requests/2.7.0 CPython/2.6.6 Linux/2.6.32-504.3.3.el6.x86_64"
09:12:27 "GET ... HTTP/1.1" 000 352 0 tbirdbld 10.000 - "python-requests/2.7.0 CPython/2.6.6 Linux/2.6.32-504.3.3.el6.x86_64"

These requests are timing out after ~10 seconds. Which coincide with the error message for Authentication timeout. The way the LDAP auth works is: 

  nginx -> ELB (internal) ldaps -> ldap proxy -> (ssl client connection) -> ldap.mozilla.org

LDAP requests go through an internal ELB to an stunnel process which proxies requests to ldap.mozilla.org through an SSL tunnel. When nightlies can't be submitted the stunnel logs show [1]:

Dec 21 06:57:45 > Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Dec 21 06:57:52 > Accepted LDAP connection
Dec 21 06:57:52 > Disconnected
Dec 21 06:57:52 > Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Dec 21 06:58:45 > Accepted LDAP connection
Dec 21 06:58:45 > Disconnected
Dec 21 06:58:45 > Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Dec 21 06:58:52 > Accepted LDAP connection
Dec 21 06:58:52 > Disconnected
Dec 21 06:58:52 > Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Dec 21 06:59:45 > Accepted LDAP connection
Dec 21 06:59:45 > Disconnected

This seems to indicate that nginx made an SSL connection to the ELB and then immediately dropped the connection. However, that signal does not make it back to the ldap auth module so it eventually times out. Restarting nginx fixes this issue because it drops all current connections. This seems to indicate that there is some sort of resource exhaustion or something getting stuck and eventually everything falls off a cliff.



[1] command: journalctl --since '2016-12-21 06:20' --until '2016-12-21 07:25' -u stunnel_ldap.service |  sed 's/ip-.*[0-9]*\]:/>/' | sed 's/connect_blocking: connected 63.245.215.32:6363/connect to ldap.mozilla.org/' | sed 's/Service.*bridge.*accepted connection from 172.*$/Accepted LDAP connection/' | sed 's/SSL_accept: Peer suddenly disconnected/Disconnected/'
Benson and I talked about this some more and we think that increasing the length of time that nginx caches auth may reduce the possibility of hitting the resource exhaustion theorized in comment #10. It was set to 30s, and he's increased it to 5min. Hopefully this will make it better for the next set of nightlies.

Benson also came up with the idea of some sort of watchdog that will look for the connection resets in the stunnel logs, and restart nginx if it sees them. This is pretty hacky though, so we don't want to go down that path right away.

This always seems to happen around midnight pacific, and both of us plan to be around at that time tonight to do some additional debugging if it happens again (and to fix it more quickly).
65 failures in 123 pushes (0.528 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-aurora: 36
* mozilla-central: 29

Platform breakdown:
* osx-10-10: 16
* linux64: 16
* linux32: 14
* windows8-32: 6
* windows8-64: 4
* osx-10-7: 2
* android-4-3-armv7-api15: 2
* android-4-2-x86: 2
* android-4-0-armv7-api15: 2
* windowsxp: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324779&startday=2016-12-21&endday=2016-12-21&tree=all
(In reply to Pascal Chevrel:pascalc from comment #13)
> Is this bug the reason why no updates are served to Nightly?
> Empty xml:
> https://aus5.mozilla.org/update/6/Firefox/53.0a1/20161220030215/Linux_x86_64-
> gcc3/fr/nightly/Linux%204.4.0-53-generic%20(GTK%203.18.9%2Clibpulse%208.0.0)/
> NA/default/default/update.xml?force=1
> 
> Yet there are builds on FTP for the 21 and some for the 22:
> http://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-central-l10n/

nightlies are no longer frozen up to dec 19th nightlies (Bug 1325115).
Assuming this bug isn't blocking, you should receive newer nightlies now
(In reply to Jordan Lund (:jlund) from comment #14)
> nightlies are no longer frozen up to dec 19th nightlies (Bug 1325115).
> Assuming this bug isn't blocking, you should receive newer nightlies now

I just received and update (20->22).
(Assignee)

Comment 16

a year ago
Last night's run completed successfully. It appears the longer ldap cache time seems to have worked around the issue. I did some log crunching for logins per 30 seconds [1] and 300 seconds [2]: 

-  30s cache: 545 LDAP requests
- 300s cache: 157 LDAP requests

So the increase to a 5min cache is about a 70% reduction in LDAP requests.

Using today's data, cache estimates for longer TTL:

- 10m cache: 99 requests, ~80% reduction
- 15m cache: 76 requests, ~86% reduction
- 30m cache: 43 requests, ~92% reduction

The balrog deployment (bug 1324850) today makes the 5min cache standard.

[1] gawk '{printf "%s %s\n",strftime("%H:%M:%S",int($1/30)*30),$9}' balrog.access.log  | grep -v @mozilla | grep -v '-' | sort | uniq -c | wc -l
[2] gawk '{printf "%s %s\n",strftime("%H:%M",int($1/300)*300),$9}' balrog.access.log  | grep -v @mozilla | grep -v '-' | sort | uniq -c | wc -l
In the medium/long term, I expect we'll be changing Balrog auth to Okta or something else that supports MFA. In that world, LDAP should be out of the loop as far as nginx is concerned...
Things look fine this morning too, so the cache fix appears to be holding. Just in case things go haywire again while people are off for holidays, I want to make note of escalation paths here.

For RelEng, CloudOps escalation is documented on https://mana.mozilla.org/wiki/display/SVCOPS/Contacting+Cloud+Operations.

For others, please use the normal RelEng escalation path - it's important that we don't page CloudOps without investigating on our own first. That's documented here: https://wiki.mozilla.org/ReleaseEngineering#Contacting_Release_Engineering
85 failures in 609 pushes (0.14 failures/push) were associated with this bug in the last 7 days. 

This is the #3 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* mozilla-central: 49
* mozilla-aurora: 36

Platform breakdown:
* osx-10-10: 22
* linux64: 22
* linux32: 17
* windows8-32: 6
* windows8-64: 5
* osx-10-7: 3
* android-4-2-x86: 3
* android-4-0-armv7-api15: 3
* windowsxp: 2
* android-4-3-armv7-api15: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324779&startday=2016-12-19&endday=2016-12-25&tree=all
Looks like we haven't hit this in ~1w, which seems like a good sign.
(Assignee)

Comment 21

a year ago
Marking as fixed.
Status: REOPENED → RESOLVED
Last Resolved: a year agoa year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.