790578 - Android/Linux Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out"

Reporter

Description

•

12 years ago

Retriggering didn't help; is affecting both armv7 and armv6 builds.

All affected builds so far were on EC2.

{
Transferring symbols... ./dist/fennec-18.0a1.en-US.android-arm.crashreporter-symbols-full.zip
debug1: Reading configuration data /home/mock_mozilla/.ssh/config
debug1: Applying options for *mozilla.com
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to symbols1.dmz.phx1.mozilla.com [10.8.74.48] port 22.
debug1: connect to address 10.8.74.48 port 22: Connection timed out
ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out
lost connection
make: *** [uploadsymbols] Error 1
}

bld-linux64-ec2-025
https://tbpl.mozilla.org/php/getParsedLog.php?id=15155501&tree=Firefox

bld-linux64-ec2-022
https://tbpl.mozilla.org/php/getParsedLog.php?id=15156109&tree=Firefox

bld-linux64-ec2-021
https://tbpl.mozilla.org/php/getParsedLog.php?id=15156209&tree=Firefox

Ed Morley [:emorley]

Reporter

Comment 1

•

12 years ago

Another retrigger not a non-EC2 slave, was green:
s: bld-centos6-hp-017
https://tbpl.mozilla.org/php/getParsedLog.php?id=15157490&tree=Firefox

Ed Morley [:emorley]

Reporter

Comment 2

•

12 years ago

Looked like it used to work on EC2:
https://tbpl.mozilla.org/php/getParsedLog.php?id=15126558&tree=Firefox

Ed Morley [:emorley]

Reporter

Comment 3

•

12 years ago

Bah, looks.

Chris AtLee [:catlee]

Comment 4

•

12 years ago

flipping this over to IT - it appears like were having problems getting to symbols1.dmz.phx1.mozilla.com from builds machines in AWS which are in the 10.130.0.0/16 network connected by vpn via scl3.

it looks to be intermittent - I can't reproduce from those machines any more.

Assignee: nobody → server-ops

Severity: blocker → normal

Component: Release Engineering → Server Operations

QA Contact: jdow

Dumitru Gherman [:dumitru]

Comment 5

•

12 years ago

Can someone tell me a machine in AWS that had issues connecting to symbols1.dmz.phx1?
I'd like to look at some logs.

Ed Morley [:emorley]

Reporter

Comment 6

•

12 years ago

(In reply to Ed Morley [:edmorley UTC+1] from comment #0)
> bld-linux64-ec2-025
...
> bld-linux64-ec2-022
... 
> bld-linux64-ec2-021

Chris AtLee [:catlee]

Comment 7

•

12 years ago

bld-linux64-ec2-021 10.130.133.21
bld-linux64-ec2-025 10.130.83.174
bld-linux64-ec2-022 10.130.206.18

Ryan VanderMeulen [:RyanVM]

Comment 8

•

12 years ago

s: bld-linux64-ec2-008

OS: Android → All

Hardware: ARM → All

Summary: mozilla-central Android Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out" → mozilla-central Android/Linux Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out"

Ryan VanderMeulen [:RyanVM]

Comment 9

•

12 years ago

s: bld-linux64-ec2-062

Matt Brubeck (:mbrubeck)

Comment 10

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15443078&tree=Firefox
s: bld-linux64-ec2-041

Phil Ringnalda (:philor)

Comment 11

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15439048&tree=Fx-Team
bld-linux64-ec2-061
https://tbpl.mozilla.org/php/getParsedLog.php?id=15444992&tree=Fx-Team
bld-linux64-ec2-007
https://tbpl.mozilla.org/php/getParsedLog.php?id=15438528&tree=Fx-Team
bld-linux64-ec2-053
https://tbpl.mozilla.org/php/getParsedLog.php?id=15444572&tree=Fx-Team
bld-linux64-ec2-068
https://tbpl.mozilla.org/php/getParsedLog.php?id=15439024&tree=Mozilla-Aurora
bld-linux64-ec2-013
https://tbpl.mozilla.org/php/getParsedLog.php?id=15444632&tree=Mozilla-Aurora
bld-linux64-ec2-036
https://tbpl.mozilla.org/php/getParsedLog.php?id=15438972&tree=Mozilla-Aurora
bld-linux64-ec2-034
https://tbpl.mozilla.org/php/getParsedLog.php?id=15444725&tree=Mozilla-Aurora
bld-linux64-ec2-002

Phil Ringnalda (:philor)

Comment 12

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15445283&tree=Firefox
bld-linux64-ec2-030

Phil Ringnalda (:philor)

Comment 13

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15446172&tree=Fx-Team
bld-linux64-ec2-003

Phil Ringnalda (:philor)

Comment 14

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15446327&tree=Mozilla-Aurora
bld-linux64-ec2-053

Phil Ringnalda (:philor)

Comment 15

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15446429&tree=Mozilla-Aurora
bld-linux64-ec2-056

Phil Ringnalda (:philor)

Updated

•

12 years ago

Summary: mozilla-central Android/Linux Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out" → Android/Linux Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out"

Phil Ringnalda (:philor)

Comment 16

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=15446750&tree=Mozilla-Aurora
bld-linux64-ec2-053

Dumitru Gherman [:dumitru]

Comment 17

•

12 years ago

Is this always happening for the above servers? Or does it work sometimes?
Do all the Amazon servers have this issue, or only some?

Phil Ringnalda (:philor)

Comment 18

•

12 years ago

If it's all the Amazon slaves, then it broke at some instant this morning, since https://tbpl.mozilla.org/php/getParsedLog.php?id=15437930&tree=Firefox is a successful Android nightly on an Amazon slave. fx-team starts its nightlies at 04:00, and aurora starts its at 04:20, and although mozilla-central, where most of them succeeded, starts its at 03:00, the one which failed there took forever and a day, and didn't finish until after 05:00. So "it's all of them, and it broke sometime between 04:00 and 05:00 today" is a reasonable guess to start from.

Nick Thomas [:nthomas] (UTC+12)

Comment 19

•

12 years ago

Probably we need to allow the 10.130.0.0/16 to talk to symbols1.dmz.phx1, like bug 784521 did for aus3-staging.

Nick Thomas [:nthomas] (UTC+12)

Comment 20

•

12 years ago

Actually, that firewall rule should be in place already, otherwise we'd have been hitting issues with Android nightlies run on EC2 earlier than this weeks desktop build switch. 

AFAICT all the slaves are getting IP addresses beginning 10.130, as is configured. In particular bld-linux64-ec2-030 has gotten 10.130.62.17 since Sep 21 22:13:55, but failed to reach the symbols host on the morning of the 22nd (comment #12). I can manually make connections to symbols1.dmz.phx1.mozilla.com and aus3-staging.mozilla.org no problems. 

Netops, do we have any logging where the tunnel from Amazon terminates ? Could we be having some intermittent issues over the link ?

Assignee: server-ops → network-operations

Component: Server Operations → Server Operations: Netops

QA Contact: jdow → ravi

Chris AtLee [:catlee]

Comment 21

•

12 years ago

If the AWS <-> link had intermittent issues, we'd see busted builds, and no ability to log into the machines. I've been able to log into the machines while also being unable to ssh from the AWS machines to the symbol server.

Assignee: network-operations → server-ops

Component: Server Operations: Netops → Server Operations

QA Contact: ravi → jdow

Chris AtLee [:catlee]

Comment 22

•

12 years ago

sorry, didn't mean to move this out of netops

Assignee: server-ops → network-operations

Component: Server Operations → Server Operations: Netops

QA Contact: jdow → ravi

Ravi Pina [:ravi]

Comment 23

•

12 years ago

I'm not sure if there is still a problem or not.  The data path looks to be fine.  Without a host in the AWS environment to test with I can't do much.

Chris AtLee [:catlee]

Comment 24

•

12 years ago

It seems to happen each night. Perhaps there's some kind of limit to number of connections that comes into play from AWS to PHX?

Phil Ringnalda (:philor)

Comment 25

•

12 years ago

Well, no, I'd say it's more the case that it happens on nights when the date ends with the digit "2" - it happened like crazy on the 12th, and again on the 22nd (and yeah, I'll be amazed if that really does turn out to be the case and the next instance is either the 2nd or 12th of October).

Dustin J. Mitchell [:dustin] (he/him)

Comment 26

•

12 years ago

http://docs.amazonwebservices.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html doesn't list any concurrent-connection limits on the AWS side.

Ravi Pina [:ravi]

Comment 27

•

12 years ago

October 2nd came and went.

Is this still an issue?

casey ransom [:casey]

Assignee

Updated

•

12 years ago

Assignee: network-operations → cransom

Ed Morley [:emorley]

Reporter

Comment 28

•

12 years ago

We had bug 797805 on the 4th, but think we're done here :-)

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

2 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard