Closed Bug 1039076 Opened 10 years ago Closed 10 years ago

AWS builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 8.1
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: KWierso, Unassigned)

Details

Trunk trees and try are seeing this, not sure what else is affected. I've closed trunk trees and aurora/beta until we know what's going on.


https://tbpl.mozilla.org/php/getParsedLog.php?id=43875605&tree=Mozilla-Inbound is one example

It's possible that bug 1031085's syntax error that was later backed out caused this, and should now by fixed by the backout?
What i see in a network capture is the server closing the session just when the client sends a ssl hello. And it doesn't even do a proper closing handshake. It sends a RST to the ACK and FIN,ACK from the client.

HTTPS is not alone failing, HTTP fails too, but differently. It sends a "proper" reply, but for a 403 error.

However, it seems to be routing related, because it's reliably happening on two non-build slave AWS instances I have, and reliably *not* happening from home (Japan) or from a server I have in France.
Common form of the error:
hg clone https://hg.mozilla.org/build/mozharness scripts
abort: error: _ssl.c:504: EOF occurred in violation of protocol

hg clone https://hg.mozilla.org/build/tools tools
abort: error: _ssl.c:504: EOF occurred in violation of protocol

Seems to be all linux, and not windows or mac (based on tbpl). Which means it might be AWS. We're looking for examples on linux scl3 h/w.

(In reply to Wes Kocher (:KWierso) from comment #0)
> It's possible that bug 1031085's syntax error that was later backed out
> caused this, and should now by fixed by the backout?

No, it never got merged to production.
No longer depends on: 1031085
'Something' happened at about 2000 Pacific which 'fixed' the problem.

rbryce was looking at Zeus, info of note:
<rbryce>	nthomas I'm seeing this in the zeuslog..
		vservers/hg.mozilla.org-ssl	maxclientbufferdrop	Dropped connection from 54.191.162.99 to port 443, request exceeded max_client_buffer limit (131072 bytes)
<rbyrce>	thats an aws ip address
<nthomas>	hmm, so we're sending too much data, or zeus isn't keeping up ?
<rbryce>	I just pulled up the protection class, and it looks like that might be the case
<nthomas>	only a single IP for hg.m.o, so only a single zeus in play right ?
<rbryce>	I believe so
<nthomas>	wonders if the b/w is hitting the limit for the s/w and nic
<rbryce>	so we are only allowing 30 simultaneous connections per IP
<rbryce>	with a max of 200 cons from the 10 busiest IP addresses
<nthomas>	I'd be surprised if we were hitting that, should be one IP per AWS instance
<nthomas>	unless hg is parallelising a lot

At 2034:
<rbryce>	nthomas on a hunch, I found a failed node in the ftp pool. I just removed it
<nthomas>	ftp2 ?
<rbryce>	it is ftp8.dmz.scl3.mozilla.com.

I tried removing the IGW routing rule for hg.m.o from rtb-d77190b2 (testers in us-east-1), between 2017 and 2107.
And at 2107:
<rbryce>	nthomas glandium as a test, I just added the aws ips I see in the zeus log to the exception list
Summary: Builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures → AWS builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures
There's still some backlog from buildbot-master78 and 79 in getting finished builds onto TBPL, but they are both try, which wasn't closed anyway. Reopened the rest of the trees.
Ok, going through the zeus logs.  I caused this with an IP block that was too wide.  I thought I was blocking traffic from Merck Pharma.  I just put 2 and 2 together that a number of aws IPs were in this range. I have fixed the error and have whitelisted the aws IPS that were being blocked. MY sincerest apologies all around for raising your pulse and closing the trees.
Great to know what the problem was. Do you think we might have resolved bug 1036176 at the same time ? And did the block apply to just one site, or to all sites behind our scl3 zlbs ?

Background - our AWS instances have 'public IPs' so they can reach hg.m.o/git.m.o/ftp-ssl.m.o over the internet, which takes a lot of load off the VPN tunnel we have to scl3. We have no control over which IP we get, AWS determines that at launch time. We have a record of the known IP blocks at
  http://hg.mozilla.org/build/cloud-tools/file/default/configs/routingtables.yml#l4
(we mainly use this to talk to S3, the instance IPs probably come from some subset of this).
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(rbryce)
Resolution: --- → FIXED
(In reply to Nick Thomas [:nthomas] from comment #7)
> Great to know what the problem was. Do you think we might have resolved bug
> 1036176 at the same time ? And did the block apply to just one site, or to
> all sites behind our scl3 zlbs ?

Just for the hg site only.
Flags: needinfo?(rbryce)
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.