Closed Bug 705126 Opened 13 years ago Closed 13 years ago

intermittent network issues at scl1

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: bhearsum, Assigned: ravi)

References

Details

We had a bunch of jobs fail when some slaves hit network issues: disconnecting from their master, connection dropped while uploading to surf, DNS resolution failures.

Thu Nov 24 02:58:10 2011 - talos-r3-w7-034
Thu Nov 24 03:01:03 2011 - talos-r3-xp-006
Thu Nov 24 03:39:39 2011 - w64-ix-slave14
Thu Nov 24 03:49:06 2011 - linux-ix-slave17
Thu Nov 24 03:52:54 2011 - talos-r3-xp-021
Thu Nov 24 03:58:21 2011 - talos-r3-xp-032
Thu Nov 24 04:00:28 2011 - talos-r3-xp-050
Thu Nov 24 04:16:53 2011 - talos-r3-xp-039
Thu Nov 24 04:26:38 2011 - w64-ix-slave09
Thu Nov 24 04:26:43 2011 - w64-ix-slave15
Thu Nov 24 04:27:40 2011 - w64-ix-slave13
Thu Nov 24 04:38:25 2011 - talos-r3-xp-006
Thu Nov 24 04:58:34 2011 - talos-r3-xp-052
Thu Nov 24 05:37:48 2011 - talos-r3-xp-049

Given the spacing between the last two failures I'm not sure if we're out of the woods yet or not. Filing as critical because the tree is not closed because of this, but it is causing longer than normal turnaround time for some results.
I also noticed that a bunch of Buildbot masters in SCL1 lost their connection to the DB server in the same window:
2011-11-24 04:00:27-0800 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/master.py", line 153, in _get_processors
	    builders = sorter(self.parent, builders)
	  File "/builds/buildbot/tests1-windows/master/master_common.py", line 55, in prioritizeBuilders
	    (time.time() - 3600, botmaster.master_name, botmaster.master_incarnation))
	  File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 182, in runQueryNow
	    return self.runInteractionNow(self._runQuery, *args, **kwargs)
	  File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow
	    return self._runInteractionNow(interaction, *args, **kwargs)
	--- <exception caught here> ---
	  File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 244, in _runInteractionNow
	    conn.rollback()
	_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')

Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 03:44:58-0800 [-] Unhandled Error
	Traceback (most recent call last):
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/base.py", line 1165, in run
	    self.mainLoop()
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/base.py", line 1174, in mainLoop
	    self.runUntilCurrent()
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/base.py", line 796, in runUntilCurrent
	    call.func(*call.args, **call.kw)
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/util/loop.py", line 146, in _loop_start
	    self._remaining = list(self.get_processors())
	--- <exception caught here> ---
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/master.py", line 153, in _get_processors
	    builders = sorter(self.parent, builders)
	  File "/builds/buildbot/build1/master/master_common.py", line 55, in prioritizeBuilders
	    requests = filter(lambda request: request[0] in allBuilderNames, requests)
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 182, in runQueryNow
	    return self.runInteractionNow(self._runQuery, *args, **kwargs)
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow
	    return self._runInteractionNow(interaction, *args, **kwargs)
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 234, in _runInteractionNow
	    conn = self.get_sync_connection()
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 228, in get_sync_connection
	    self._nonpool = self._spec.get_sync_connection()
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/dbspec.py", line 250, in get_sync_connection
	    conn = dbapi.connect(*self.connargs, **connkw)
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/MySQLdb/__init__.py", line 81, in Connect
	    return Connection(*args, **kwargs)
	  File "/builds/buildbot/build1/lib/python2.6/site-packages/MySQLdb/connections.py", line 187, in __init__
	    super(Connection, self).__init__(*args, **kwargs2)
	_mysql_exceptions.OperationalError: (2003, "Can't connect to MySQL server on 'tm-b01-master01.mozilla.org' (113)")

...and a lot more like that.
Assignee: server-ops-releng → network-operations
Component: Server Operations: RelEng → Server Operations: Netops
QA Contact: zandr → mrz
We just had ~25 windows slaves fail to clone a repository with this error:
abort: error: getaddrinfo failed
Updating summary since we're still seeing the issues.
Summary: some scl slaves were dropping connections for awhile this morning → intermittent network issues at scl1
A bunch of DNS failures on a master, too:
The following exceptions (total 12) were detected on buildbot-master13.build.scl1.mozilla.com bm13-build1:

Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:29:10-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:29:10-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:29:42-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:29:42-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:30:14-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:30:14-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:32:06-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:32:06-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:32:35-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:32:35-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:32:35-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.

--------------------------------------------------------------------------------
Exception in /builds/buildbot/build1/master/twistd.log:
2011-11-24 07:32:35-0800 [-] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.
talos-r3-xp-005: Thu Nov 24 09:58:37 2011
From the ThunderbirdTrunk builders (generally <something>.sj.mozillamessaging.com), we had a few of:

ssh: connect to host stage.mozilla.org port 22: Connection refused

Times on various builders:

- Thu Nov 24 08:44:37 2011
- Thu Nov 24 08:50:38 2011
- Thu Nov 24 09:06:10 2011
- Thu Nov 24 09:37:14 2011

We also had:

Connecting to stage.mozilla.org|63.245.208.158|:80... failed: Connection refused.

- Thu Nov 24 09:40:31 2011

The indications are that they seem to have recovered at the moment.
Are things stable?
Assignee: network-operations → ravi
I haven't seen any failures since my last comment, so they seem to be for now.
Of course, now I see:
16:05 < nagios-sjc1> [86] buildbot-master14.build.scl1:MySQL connectivity is WARNING: Unknown MySQL server host tm-b01-master01.mozilla.org (1)
16:05 < nagios-sjc1> [89] buildbot-master13.build.scl1:MySQL connectivity is WARNING: Unknown MySQL server host tm-b01-master01.mozilla.org (1)

So we're still having intermittent issues, but in smaller bursts than previously.
I just had two try jobs fail around 13:13 with:
utils.talosError: "Graph server unreachable (5 attempts)\n(11004, 'getaddrinfo failed')"

https://tbpl.mozilla.org/php/getParsedLog.php?id=7573804&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=7573784&tree=Try&full=1
And about 20-30 other jobs across mozilla-central and mozilla-inbound failed due to DNS issues, too.
What is the configured resolover for the hosts having issues?
(In reply to Ravi Pina [:ravi] from comment #12)
> What is the configured resolover for the hosts having issues?

talos-r3-xp-005 is configured as such:
C:\Documents and Settings\cltbld>nslookup
Default Server:  ns1.infra.scl1.mozilla.com
Address:  10.12.75.10

I assume the others are as well
And here's more detailed network configuration info:
        Connection-specific DNS Suffix  . : build.scl1.mozilla.com
        Description . . . . . . . . . . . : NVIDIA nForce 10/100/1000 Mbps Ethernet #2
        Physical Address. . . . . . . . . : D4-9A-20-BC-CF-A4
        Dhcp Enabled. . . . . . . . . . . : Yes
        Autoconfiguration Enabled . . . . : Yes
        IP Address. . . . . . . . . . . . : 10.12.50.113
        Subnet Mask . . . . . . . . . . . : 255.255.248.0
        Default Gateway . . . . . . . . . : 10.12.48.1
        DHCP Server . . . . . . . . . . . : 10.12.75.10
        DNS Servers . . . . . . . . . . . : 10.12.75.10
                                            10.12.75.12
        Lease Obtained. . . . . . . . . . : Thursday, November 24, 2011 2:11:31 PM
        Lease Expires . . . . . . . . . . : Friday, November 25, 2011 2:11:31 PM
From reading nagios alerts, here is more data which may help with debugging:

1) between 03:36-03:52 PST today, we saw connectivity problems with:
switch1.r101-{1,2,3,4,5,7,8,9,11,12,13,14}.ops.scl1
switch1.r102-{2,3,4}.ops.scl1
...which seemed to all (afaict) recover between 03:52-03:55 PST



2) between 13:04-13:15 PST today, we saw connectivity problems with:
scl-production-puppet.build.scl1
buildbot-master{04,06,11,12,13,14,15,16,17,18}.build.scl1
hg{1,2}.build.scl1
dev-master01.build.scl1
...which seemed to all (afaict) recover between 13:15-13:31 PST
I haven't seen any failures since the ones around 1pm PST yesterday. Downgrading severity.
Severity: critical → major
There is a known issue with the version of JUNOS we're running in SCL1.  Bug 705823 is open to upgrade to a version which has the fix (among others).
Blocks: 705823
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.