1039076 - AWS builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Description

•

11 years ago

Trunk trees and try are seeing this, not sure what else is affected. I've closed trunk trees and aurora/beta until we know what's going on. https://tbpl.mozilla.org/php/getParsedLog.php?id=43875605&tree=Mozilla-Inbound is one example It's possible that bug 1031085's syntax error that was later backed out caused this, and should now by fixed by the backout?

Mike Hommey [:glandium]

Comment 1

•

11 years ago

What i see in a network capture is the server closing the session just when the client sends a ssl hello. And it doesn't even do a proper closing handshake. It sends a RST to the ACK and FIN,ACK from the client. HTTPS is not alone failing, HTTP fails too, but differently. It sends a "proper" reply, but for a 403 error. However, it seems to be routing related, because it's reliably happening on two non-build slave AWS instances I have, and reliably *not* happening from home (Japan) or from a server I have in France.

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

11 years ago

Common form of the error: hg clone https://hg.mozilla.org/build/mozharness scripts abort: error: _ssl.c:504: EOF occurred in violation of protocol hg clone https://hg.mozilla.org/build/tools tools abort: error: _ssl.c:504: EOF occurred in violation of protocol Seems to be all linux, and not windows or mac (based on tbpl). Which means it might be AWS. We're looking for examples on linux scl3 h/w. (In reply to Wes Kocher (:KWierso) from comment #0) > It's possible that bug 1031085's syntax error that was later backed out > caused this, and should now by fixed by the backout? No, it never got merged to production.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

11 years ago

No longer depends on: 1031085

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

11 years ago

'Something' happened at about 2000 Pacific which 'fixed' the problem. rbryce was looking at Zeus, info of note: <rbryce> nthomas I'm seeing this in the zeuslog.. vservers/hg.mozilla.org-ssl maxclientbufferdrop Dropped connection from 54.191.162.99 to port 443, request exceeded max_client_buffer limit (131072 bytes) <rbyrce> thats an aws ip address <nthomas> hmm, so we're sending too much data, or zeus isn't keeping up ? <rbryce> I just pulled up the protection class, and it looks like that might be the case <nthomas> only a single IP for hg.m.o, so only a single zeus in play right ? <rbryce> I believe so <nthomas> wonders if the b/w is hitting the limit for the s/w and nic <rbryce> so we are only allowing 30 simultaneous connections per IP <rbryce> with a max of 200 cons from the 10 busiest IP addresses <nthomas> I'd be surprised if we were hitting that, should be one IP per AWS instance <nthomas> unless hg is parallelising a lot At 2034: <rbryce> nthomas on a hunch, I found a failed node in the ftp pool. I just removed it <nthomas> ftp2 ? <rbryce> it is ftp8.dmz.scl3.mozilla.com. I tried removing the IGW routing rule for hg.m.o from rtb-d77190b2 (testers in us-east-1), between 2017 and 2107.

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

11 years ago

And at 2107: <rbryce> nthomas glandium as a test, I just added the aws ips I see in the zeus log to the exception list

Summary: Builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures → AWS builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

11 years ago

There's still some backlog from buildbot-master78 and 79 in getting finished builds onto TBPL, but they are both try, which wasn't closed anyway. Reopened the rest of the trees.

Rick Bryce [:rbryce]

Comment 6

•

11 years ago

Ok, going through the zeus logs. I caused this with an IP block that was too wide. I thought I was blocking traffic from Merck Pharma. I just put 2 and 2 together that a number of aws IPs were in this range. I have fixed the error and have whitelisted the aws IPS that were being blocked. MY sincerest apologies all around for raising your pulse and closing the trees.

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

11 years ago

Great to know what the problem was. Do you think we might have resolved bug 1036176 at the same time ? And did the block apply to just one site, or to all sites behind our scl3 zlbs ? Background - our AWS instances have 'public IPs' so they can reach hg.m.o/git.m.o/ftp-ssl.m.o over the internet, which takes a lot of load off the VPN tunnel we have to scl3. We have no control over which IP we get, AWS determines that at launch time. We have a record of the known IP blocks at http://hg.mozilla.org/build/cloud-tools/file/default/configs/routingtables.yml#l4 (we mainly use this to talk to S3, the instance IPs probably come from some subset of this).

Status: NEW → RESOLVED

Closed: 11 years ago

Flags: needinfo?(rbryce)

Resolution: --- → FIXED

Rick Bryce [:rbryce]

Comment 8

•

11 years ago

(In reply to Nick Thomas [:nthomas] from comment #7) > Great to know what the problem was. Do you think we might have resolved bug > 1036176 at the same time ? And did the block apply to just one site, or to > all sites behind our scl3 zlbs ? Just for the hg site only.

Flags: needinfo?(rbryce)

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

AWS builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: KWierso, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated