Closed
Bug 1039076
Opened 10 years ago
Closed 10 years ago
AWS builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: KWierso, Unassigned)
Details
Trunk trees and try are seeing this, not sure what else is affected. I've closed trunk trees and aurora/beta until we know what's going on. https://tbpl.mozilla.org/php/getParsedLog.php?id=43875605&tree=Mozilla-Inbound is one example It's possible that bug 1031085's syntax error that was later backed out caused this, and should now by fixed by the backout?
Comment 1•10 years ago
|
||
What i see in a network capture is the server closing the session just when the client sends a ssl hello. And it doesn't even do a proper closing handshake. It sends a RST to the ACK and FIN,ACK from the client. HTTPS is not alone failing, HTTP fails too, but differently. It sends a "proper" reply, but for a 403 error. However, it seems to be routing related, because it's reliably happening on two non-build slave AWS instances I have, and reliably *not* happening from home (Japan) or from a server I have in France.
Comment 2•10 years ago
|
||
Common form of the error: hg clone https://hg.mozilla.org/build/mozharness scripts abort: error: _ssl.c:504: EOF occurred in violation of protocol hg clone https://hg.mozilla.org/build/tools tools abort: error: _ssl.c:504: EOF occurred in violation of protocol Seems to be all linux, and not windows or mac (based on tbpl). Which means it might be AWS. We're looking for examples on linux scl3 h/w. (In reply to Wes Kocher (:KWierso) from comment #0) > It's possible that bug 1031085's syntax error that was later backed out > caused this, and should now by fixed by the backout? No, it never got merged to production.
Comment 3•10 years ago
|
||
'Something' happened at about 2000 Pacific which 'fixed' the problem. rbryce was looking at Zeus, info of note: <rbryce> nthomas I'm seeing this in the zeuslog.. vservers/hg.mozilla.org-ssl maxclientbufferdrop Dropped connection from 54.191.162.99 to port 443, request exceeded max_client_buffer limit (131072 bytes) <rbyrce> thats an aws ip address <nthomas> hmm, so we're sending too much data, or zeus isn't keeping up ? <rbryce> I just pulled up the protection class, and it looks like that might be the case <nthomas> only a single IP for hg.m.o, so only a single zeus in play right ? <rbryce> I believe so <nthomas> wonders if the b/w is hitting the limit for the s/w and nic <rbryce> so we are only allowing 30 simultaneous connections per IP <rbryce> with a max of 200 cons from the 10 busiest IP addresses <nthomas> I'd be surprised if we were hitting that, should be one IP per AWS instance <nthomas> unless hg is parallelising a lot At 2034: <rbryce> nthomas on a hunch, I found a failed node in the ftp pool. I just removed it <nthomas> ftp2 ? <rbryce> it is ftp8.dmz.scl3.mozilla.com. I tried removing the IGW routing rule for hg.m.o from rtb-d77190b2 (testers in us-east-1), between 2017 and 2107.
Comment 4•10 years ago
|
||
And at 2107: <rbryce> nthomas glandium as a test, I just added the aws ips I see in the zeus log to the exception list
Summary: Builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures → AWS builds and test jobs failing with abort: error: _ssl.c:510: EOF occurred in violation of protocol among other failures
Comment 5•10 years ago
|
||
There's still some backlog from buildbot-master78 and 79 in getting finished builds onto TBPL, but they are both try, which wasn't closed anyway. Reopened the rest of the trees.
Comment 6•10 years ago
|
||
Ok, going through the zeus logs. I caused this with an IP block that was too wide. I thought I was blocking traffic from Merck Pharma. I just put 2 and 2 together that a number of aws IPs were in this range. I have fixed the error and have whitelisted the aws IPS that were being blocked. MY sincerest apologies all around for raising your pulse and closing the trees.
Comment 7•10 years ago
|
||
Great to know what the problem was. Do you think we might have resolved bug 1036176 at the same time ? And did the block apply to just one site, or to all sites behind our scl3 zlbs ? Background - our AWS instances have 'public IPs' so they can reach hg.m.o/git.m.o/ftp-ssl.m.o over the internet, which takes a lot of load off the VPN tunnel we have to scl3. We have no control over which IP we get, AWS determines that at launch time. We have a record of the known IP blocks at http://hg.mozilla.org/build/cloud-tools/file/default/configs/routingtables.yml#l4 (we mainly use this to talk to S3, the instance IPs probably come from some subset of this).
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(rbryce)
Resolution: --- → FIXED
Comment 8•10 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #7) > Great to know what the problem was. Do you think we might have resolved bug > 1036176 at the same time ? And did the block apply to just one site, or to > all sites behind our scl3 zlbs ? Just for the hg site only.
Flags: needinfo?(rbryce)
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•