Closed Bug 790578 Opened 12 years ago Closed 12 years ago

Android/Linux Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out"

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: cransom)

Details

Retriggering didn't help; is affecting both armv7 and armv6 builds.

All affected builds so far were on EC2.

{
Transferring symbols... ./dist/fennec-18.0a1.en-US.android-arm.crashreporter-symbols-full.zip
debug1: Reading configuration data /home/mock_mozilla/.ssh/config
debug1: Applying options for *mozilla.com
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Connecting to symbols1.dmz.phx1.mozilla.com [10.8.74.48] port 22.
debug1: connect to address 10.8.74.48 port 22: Connection timed out
ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out
lost connection
make: *** [uploadsymbols] Error 1
}

bld-linux64-ec2-025
https://tbpl.mozilla.org/php/getParsedLog.php?id=15155501&tree=Firefox

bld-linux64-ec2-022
https://tbpl.mozilla.org/php/getParsedLog.php?id=15156109&tree=Firefox

bld-linux64-ec2-021
https://tbpl.mozilla.org/php/getParsedLog.php?id=15156209&tree=Firefox
Another retrigger not a non-EC2 slave, was green:
s: bld-centos6-hp-017
https://tbpl.mozilla.org/php/getParsedLog.php?id=15157490&tree=Firefox
Bah, looks.
flipping this over to IT - it appears like were having problems getting to symbols1.dmz.phx1.mozilla.com from builds machines in AWS which are in the 10.130.0.0/16 network connected by vpn via scl3.

it looks to be intermittent - I can't reproduce from those machines any more.
Assignee: nobody → server-ops
Severity: blocker → normal
Component: Release Engineering → Server Operations
QA Contact: jdow
Can someone tell me a machine in AWS that had issues connecting to symbols1.dmz.phx1?
I'd like to look at some logs.
(In reply to Ed Morley [:edmorley UTC+1] from comment #0)
> bld-linux64-ec2-025
...
> bld-linux64-ec2-022
... 
> bld-linux64-ec2-021
bld-linux64-ec2-021 10.130.133.21
bld-linux64-ec2-025 10.130.83.174
bld-linux64-ec2-022 10.130.206.18
s: bld-linux64-ec2-008
OS: Android → All
Hardware: ARM → All
Summary: mozilla-central Android Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out" → mozilla-central Android/Linux Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out"
s: bld-linux64-ec2-062
Summary: mozilla-central Android/Linux Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out" → Android/Linux Nightlies failing in upload symbols with "ssh: connect to host symbols1.dmz.phx1.mozilla.com port 22: Connection timed out"
Is this always happening for the above servers? Or does it work sometimes?
Do all the Amazon servers have this issue, or only some?
If it's all the Amazon slaves, then it broke at some instant this morning, since https://tbpl.mozilla.org/php/getParsedLog.php?id=15437930&tree=Firefox is a successful Android nightly on an Amazon slave. fx-team starts its nightlies at 04:00, and aurora starts its at 04:20, and although mozilla-central, where most of them succeeded, starts its at 03:00, the one which failed there took forever and a day, and didn't finish until after 05:00. So "it's all of them, and it broke sometime between 04:00 and 05:00 today" is a reasonable guess to start from.
Probably we need to allow the 10.130.0.0/16 to talk to symbols1.dmz.phx1, like bug 784521 did for aus3-staging.
Actually, that firewall rule should be in place already, otherwise we'd have been hitting issues with Android nightlies run on EC2 earlier than this weeks desktop build switch. 

AFAICT all the slaves are getting IP addresses beginning 10.130, as is configured. In particular bld-linux64-ec2-030 has gotten 10.130.62.17 since Sep 21 22:13:55, but failed to reach the symbols host on the morning of the 22nd (comment #12). I can manually make connections to symbols1.dmz.phx1.mozilla.com and aus3-staging.mozilla.org no problems. 

Netops, do we have any logging where the tunnel from Amazon terminates ? Could we be having some intermittent issues over the link ?
Assignee: server-ops → network-operations
Component: Server Operations → Server Operations: Netops
QA Contact: jdow → ravi
If the AWS <-> link had intermittent issues, we'd see busted builds, and no ability to log into the machines. I've been able to log into the machines while also being unable to ssh from the AWS machines to the symbol server.
Assignee: network-operations → server-ops
Component: Server Operations: Netops → Server Operations
QA Contact: ravi → jdow
sorry, didn't mean to move this out of netops
Assignee: server-ops → network-operations
Component: Server Operations → Server Operations: Netops
QA Contact: jdow → ravi
I'm not sure if there is still a problem or not.  The data path looks to be fine.  Without a host in the AWS environment to test with I can't do much.
It seems to happen each night. Perhaps there's some kind of limit to number of connections that comes into play from AWS to PHX?
Well, no, I'd say it's more the case that it happens on nights when the date ends with the digit "2" - it happened like crazy on the 12th, and again on the 22nd (and yeah, I'll be amazed if that really does turn out to be the case and the next instance is either the 2nd or 12th of October).
October 2nd came and went.

Is this still an issue?
Assignee: network-operations → cransom
We had bug 797805 on the 4th, but think we're done here :-)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.