1324779 - Balrog Admin reporting SSL errors

Assignee

Description

•

7 years ago

Post mortem for balrog-admin SSL errors: 

At about 9:45am UTC the balrog-admin server started logging errors like: 

> 2016/12/20 09:46:55 [error] 4298#0: *1329686 http_auth_ldap: Authentication timed out, client: ..., \
>   server: , request: "GET /api/releases/Thunderbird-comm-aurora-nightly-20161220004022 \
>   HTTP/1.1", host: "aus4-admin.mozilla.org"

This started causing problems for nightly submission around 14:00UTC.
After digging down and verifying that the LDAP connection was fine I restarted the nginx process on the admin server. 

This seems to have fixed the issues. It appears that the nginx ldap module is not 100% stable and there is a bug that causes this to happen. It's been about 17 days since we deployed the server so the problem does take some time to manifest itself. 

Creating this bug for posterity and to refer back to if this problem happens again.

Benson Wong [:mostlygeek]

Assignee

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Carsten Book [:Tomcat]

Comment 2

•

7 years ago

seems not fixed as some nightlys hit this again https://treeherder.mozilla.org/logviewer.html#?job_id=4506410&repo=mozilla-aurora

Status: RESOLVED → REOPENED

Flags: needinfo?(bwong)

Resolution: FIXED → ---

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

7 years ago

and on m-c

Carsten Book [:Tomcat]

Comment 4

•

7 years ago

setting to blocker since we got now nightly user complaints about not getting updated

Severity: normal → blocker

Benson Wong [:mostlygeek]

Assignee

Comment 5

•

7 years ago

See lots of errors like: 

http_auth_ldap: Authentication timed out, client: x.x.x.x, server: , request: "GET /api/releases/Firefox-mozilla-central-nightly-20161221030226 HTTP/1.1", host: "aus4-admin.mozilla.org"

restarting nginx seems to have resolved the issue. 
The nginx-auth-ldap we are running is commit [1] 49a8b4d28fc4a518563c82e0b52821e5f37db1fc which is fairly recently. 

[1] https://github.com/kvspb/nginx-auth-ldap/commit/49a8b4d28fc4a518563c82e0b52821e5f37db1fc

Flags: needinfo?(bwong)

Joel Maher ( :jmaher ) (UTC -8)

Comment 6

•

7 years ago

can we retrigger all the nightlies and verify this works, I didn't see indication on treeherder of that being done yesterday.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 7

•

7 years ago

Can we have a Nagios check for that particular issue? It's kinda bad that we have to wait for the next Nightly builds to figure out that this issue is still persistent.

Virtual_ManPL [:Virtual] 🇵🇱 - (please needinfo? me - so I will see your comment/reply/question/etc.)

Comment 8

•

7 years ago

FYI - My Nightly (2016-12-19) just updated to (2016-12-21).

Jordan Lund (:jlund)

Updated

•

7 years ago

Comment 9

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #7)
> Can we have a Nagios check for that particular issue? It's kinda bad that we
> have to wait for the next Nightly builds to figure out that this issue is
> still persistent.

I'm not sure if this is what we want, but I do think it's important to take some action to prevent this or shorten the window before tomorrow's nightlies. I'll be talking with Benson later today about it.

Benson Wong [:mostlygeek]

Assignee

Comment 10

•

7 years ago

I enabled debug logging for nginx errors. So if this happens again we can have more detailed information on why it is happening. 

The nginx logs show: 

09:12:41 "GET ... HTTP/1.1" 000 352 0 tbirdbld 10.009 - "python-requests/2.7.0 CPython/2.6.6 Linux/2.6.32-504.3.3.el6.x86_64"
09:12:54 "GET ... HTTP/1.1" 000 352 0 tbirdbld 10.002 - "python-requests/2.7.0 CPython/2.6.6 Linux/2.6.32-504.3.3.el6.x86_64"
09:12:06 "GET ... HTTP/1.1" 000 352 0 tbirdbld 10.005 - "python-requests/2.7.0 CPython/2.6.6 Linux/2.6.32-504.3.3.el6.x86_64"
09:12:27 "GET ... HTTP/1.1" 000 352 0 tbirdbld 10.000 - "python-requests/2.7.0 CPython/2.6.6 Linux/2.6.32-504.3.3.el6.x86_64"

These requests are timing out after ~10 seconds. Which coincide with the error message for Authentication timeout. The way the LDAP auth works is: 

  nginx -> ELB (internal) ldaps -> ldap proxy -> (ssl client connection) -> ldap.mozilla.org

LDAP requests go through an internal ELB to an stunnel process which proxies requests to ldap.mozilla.org through an SSL tunnel. When nightlies can't be submitted the stunnel logs show [1]:

Dec 21 06:57:45 > Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Dec 21 06:57:52 > Accepted LDAP connection
Dec 21 06:57:52 > Disconnected
Dec 21 06:57:52 > Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Dec 21 06:58:45 > Accepted LDAP connection
Dec 21 06:58:45 > Disconnected
Dec 21 06:58:45 > Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Dec 21 06:58:52 > Accepted LDAP connection
Dec 21 06:58:52 > Disconnected
Dec 21 06:58:52 > Connection reset: 0 byte(s) sent to SSL, 0 byte(s) sent to socket
Dec 21 06:59:45 > Accepted LDAP connection
Dec 21 06:59:45 > Disconnected

This seems to indicate that nginx made an SSL connection to the ELB and then immediately dropped the connection. However, that signal does not make it back to the ldap auth module so it eventually times out. Restarting nginx fixes this issue because it drops all current connections. This seems to indicate that there is some sort of resource exhaustion or something getting stuck and eventually everything falls off a cliff.



[1] command: journalctl --since '2016-12-21 06:20' --until '2016-12-21 07:25' -u stunnel_ldap.service |  sed 's/ip-.*[0-9]*\]:/>/' | sed 's/connect_blocking: connected 63.245.215.32:6363/connect to ldap.mozilla.org/' | sed 's/Service.*bridge.*accepted connection from 172.*$/Accepted LDAP connection/' | sed 's/SSL_accept: Peer suddenly disconnected/Disconnected/'

bhearsum@mozilla.com (:bhearsum)

Comment 11

•

7 years ago

Benson and I talked about this some more and we think that increasing the length of time that nginx caches auth may reduce the possibility of hitting the resource exhaustion theorized in comment #10. It was set to 30s, and he's increased it to 5min. Hopefully this will make it better for the next set of nightlies.

Benson also came up with the idea of some sort of watchdog that will look for the connection resets in the stunnel logs, and restart nginx if it sees them. This is pretty hacky though, so we don't want to go down that path right away.

This always seems to happen around midnight pacific, and both of us plan to be around at that time tonight to do some additional debugging if it happens again (and to fix it more quickly).

Comment hidden (Intermittent Failures Robot)

65 failures in 123 pushes (0.528 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-aurora: 36
* mozilla-central: 29

Platform breakdown:
* osx-10-10: 16
* linux64: 16
* linux32: 14
* windows8-32: 6
* windows8-64: 4
* osx-10-7: 2
* android-4-3-armv7-api15: 2
* android-4-2-x86: 2
* android-4-0-armv7-api15: 2
* windowsxp: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324779&startday=2016-12-21&endday=2016-12-21&tree=all

Pascal Chevrel:pascalc (PTO until April 26)

Comment 13

•

7 years ago

Is this bug the reason why no updates are served to Nightly?
Empty xml:
https://aus5.mozilla.org/update/6/Firefox/53.0a1/20161220030215/Linux_x86_64-gcc3/fr/nightly/Linux%204.4.0-53-generic%20(GTK%203.18.9%2Clibpulse%208.0.0)/NA/default/default/update.xml?force=1

Yet there are builds on FTP for the 21 and some for the 22:
http://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-central-l10n/

Jordan Lund (:jlund)

Comment 14

•

7 years ago

(In reply to Pascal Chevrel:pascalc from comment #13)
> Is this bug the reason why no updates are served to Nightly?
> Empty xml:
> https://aus5.mozilla.org/update/6/Firefox/53.0a1/20161220030215/Linux_x86_64-
> gcc3/fr/nightly/Linux%204.4.0-53-generic%20(GTK%203.18.9%2Clibpulse%208.0.0)/
> NA/default/default/update.xml?force=1
> 
> Yet there are builds on FTP for the 21 and some for the 22:
> http://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-central-l10n/

nightlies are no longer frozen up to dec 19th nightlies (Bug 1325115).
Assuming this bug isn't blocking, you should receive newer nightlies now

Pascal Chevrel:pascalc (PTO until April 26)

Comment 15

•

7 years ago

(In reply to Jordan Lund (:jlund) from comment #14)
> nightlies are no longer frozen up to dec 19th nightlies (Bug 1325115).
> Assuming this bug isn't blocking, you should receive newer nightlies now

I just received and update (20->22).

Benson Wong [:mostlygeek]

Assignee

Comment 16

•

7 years ago

Last night's run completed successfully. It appears the longer ldap cache time seems to have worked around the issue. I did some log crunching for logins per 30 seconds [1] and 300 seconds [2]: 

-  30s cache: 545 LDAP requests
- 300s cache: 157 LDAP requests

So the increase to a 5min cache is about a 70% reduction in LDAP requests.

Using today's data, cache estimates for longer TTL:

- 10m cache: 99 requests, ~80% reduction
- 15m cache: 76 requests, ~86% reduction
- 30m cache: 43 requests, ~92% reduction

The balrog deployment (bug 1324850) today makes the 5min cache standard.

[1] gawk '{printf "%s %s\n",strftime("%H:%M:%S",int($1/30)*30),$9}' balrog.access.log  | grep -v @mozilla | grep -v '-' | sort | uniq -c | wc -l
[2] gawk '{printf "%s %s\n",strftime("%H:%M",int($1/300)*300),$9}' balrog.access.log  | grep -v @mozilla | grep -v '-' | sort | uniq -c | wc -l

bhearsum@mozilla.com (:bhearsum)

Comment 17

•

7 years ago

In the medium/long term, I expect we'll be changing Balrog auth to Okta or something else that supports MFA. In that world, LDAP should be out of the loop as far as nginx is concerned...

bhearsum@mozilla.com (:bhearsum)

Comment 18

•

7 years ago

Things look fine this morning too, so the cache fix appears to be holding. Just in case things go haywire again while people are off for holidays, I want to make note of escalation paths here.

For RelEng, CloudOps escalation is documented on https://mana.mozilla.org/wiki/display/SVCOPS/Contacting+Cloud+Operations.

For others, please use the normal RelEng escalation path - it's important that we don't page CloudOps without investigating on our own first. That's documented here: https://wiki.mozilla.org/ReleaseEngineering#Contacting_Release_Engineering

Comment hidden (Intermittent Failures Robot)

85 failures in 609 pushes (0.14 failures/push) were associated with this bug in the last 7 days. 

This is the #3 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* mozilla-central: 49
* mozilla-aurora: 36

Platform breakdown:
* osx-10-10: 22
* linux64: 22
* linux32: 17
* windows8-32: 6
* windows8-64: 5
* osx-10-7: 3
* android-4-2-x86: 3
* android-4-0-armv7-api15: 3
* windowsxp: 2
* android-4-3-armv7-api15: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324779&startday=2016-12-19&endday=2016-12-25&tree=all

bhearsum@mozilla.com (:bhearsum)

Comment 20

•

7 years ago

Looks like we haven't hit this in ~1w, which seems like a good sign.

Benson Wong [:mostlygeek]

Assignee

Comment 21

•

7 years ago

Marking as fixed.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

4 years ago

Product: Release Engineering → Release Engineering Graveyard

Bugzilla

Quick Search

Balrog Admin reporting SSL errors

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

Tracking

(Not tracked)

People

(Reporter: mostlygeek, Assigned: mostlygeek)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21

Updated