Closed
Bug 705126
Opened 13 years ago
Closed 13 years ago
intermittent network issues at scl1
Categories
(Infrastructure & Operations Graveyard :: NetOps, task)
Infrastructure & Operations Graveyard
NetOps
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: bhearsum, Assigned: ravi)
References
Details
We had a bunch of jobs fail when some slaves hit network issues: disconnecting from their master, connection dropped while uploading to surf, DNS resolution failures. Thu Nov 24 02:58:10 2011 - talos-r3-w7-034 Thu Nov 24 03:01:03 2011 - talos-r3-xp-006 Thu Nov 24 03:39:39 2011 - w64-ix-slave14 Thu Nov 24 03:49:06 2011 - linux-ix-slave17 Thu Nov 24 03:52:54 2011 - talos-r3-xp-021 Thu Nov 24 03:58:21 2011 - talos-r3-xp-032 Thu Nov 24 04:00:28 2011 - talos-r3-xp-050 Thu Nov 24 04:16:53 2011 - talos-r3-xp-039 Thu Nov 24 04:26:38 2011 - w64-ix-slave09 Thu Nov 24 04:26:43 2011 - w64-ix-slave15 Thu Nov 24 04:27:40 2011 - w64-ix-slave13 Thu Nov 24 04:38:25 2011 - talos-r3-xp-006 Thu Nov 24 04:58:34 2011 - talos-r3-xp-052 Thu Nov 24 05:37:48 2011 - talos-r3-xp-049 Given the spacing between the last two failures I'm not sure if we're out of the woods yet or not. Filing as critical because the tree is not closed because of this, but it is causing longer than normal turnaround time for some results.
Reporter | ||
Comment 1•13 years ago
|
||
I also noticed that a bunch of Buildbot masters in SCL1 lost their connection to the DB server in the same window: 2011-11-24 04:00:27-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/master.py", line 153, in _get_processors builders = sorter(self.parent, builders) File "/builds/buildbot/tests1-windows/master/master_common.py", line 55, in prioritizeBuilders (time.time() - 3600, botmaster.master_name, botmaster.master_incarnation)) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 182, in runQueryNow return self.runInteractionNow(self._runQuery, *args, **kwargs) File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow return self._runInteractionNow(interaction, *args, **kwargs) --- <exception caught here> --- File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 244, in _runInteractionNow conn.rollback() _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away') Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 03:44:58-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/base.py", line 1165, in run self.mainLoop() File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/base.py", line 1174, in mainLoop self.runUntilCurrent() File "/builds/buildbot/build1/lib/python2.6/site-packages/twisted/internet/base.py", line 796, in runUntilCurrent call.func(*call.args, **call.kw) File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/util/loop.py", line 146, in _loop_start self._remaining = list(self.get_processors()) --- <exception caught here> --- File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/master.py", line 153, in _get_processors builders = sorter(self.parent, builders) File "/builds/buildbot/build1/master/master_common.py", line 55, in prioritizeBuilders requests = filter(lambda request: request[0] in allBuilderNames, requests) File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 182, in runQueryNow return self.runInteractionNow(self._runQuery, *args, **kwargs) File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 212, in runInteractionNow return self._runInteractionNow(interaction, *args, **kwargs) File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 234, in _runInteractionNow conn = self.get_sync_connection() File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/connector.py", line 228, in get_sync_connection self._nonpool = self._spec.get_sync_connection() File "/builds/buildbot/build1/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/db/dbspec.py", line 250, in get_sync_connection conn = dbapi.connect(*self.connargs, **connkw) File "/builds/buildbot/build1/lib/python2.6/site-packages/MySQLdb/__init__.py", line 81, in Connect return Connection(*args, **kwargs) File "/builds/buildbot/build1/lib/python2.6/site-packages/MySQLdb/connections.py", line 187, in __init__ super(Connection, self).__init__(*args, **kwargs2) _mysql_exceptions.OperationalError: (2003, "Can't connect to MySQL server on 'tm-b01-master01.mozilla.org' (113)") ...and a lot more like that.
Updated•13 years ago
|
Assignee: server-ops-releng → network-operations
Component: Server Operations: RelEng → Server Operations: Netops
QA Contact: zandr → mrz
Reporter | ||
Comment 2•13 years ago
|
||
We just had ~25 windows slaves fail to clone a repository with this error: abort: error: getaddrinfo failed
Reporter | ||
Comment 3•13 years ago
|
||
Updating summary since we're still seeing the issues.
Summary: some scl slaves were dropping connections for awhile this morning → intermittent network issues at scl1
Reporter | ||
Comment 4•13 years ago
|
||
A bunch of DNS failures on a master, too: The following exceptions (total 12) were detected on buildbot-master13.build.scl1.mozilla.com bm13-build1: Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:29:10-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:29:10-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:29:42-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:29:42-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:30:14-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:30:14-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:32:06-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:32:06-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:32:35-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:32:35-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:32:35-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known. -------------------------------------------------------------------------------- Exception in /builds/buildbot/build1/master/twistd.log: 2011-11-24 07:32:35-0800 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.
Reporter | ||
Comment 5•13 years ago
|
||
talos-r3-xp-005: Thu Nov 24 09:58:37 2011
Comment 6•13 years ago
|
||
From the ThunderbirdTrunk builders (generally <something>.sj.mozillamessaging.com), we had a few of: ssh: connect to host stage.mozilla.org port 22: Connection refused Times on various builders: - Thu Nov 24 08:44:37 2011 - Thu Nov 24 08:50:38 2011 - Thu Nov 24 09:06:10 2011 - Thu Nov 24 09:37:14 2011 We also had: Connecting to stage.mozilla.org|63.245.208.158|:80... failed: Connection refused. - Thu Nov 24 09:40:31 2011 The indications are that they seem to have recovered at the moment.
Reporter | ||
Comment 8•13 years ago
|
||
I haven't seen any failures since my last comment, so they seem to be for now.
Reporter | ||
Comment 9•13 years ago
|
||
Of course, now I see: 16:05 < nagios-sjc1> [86] buildbot-master14.build.scl1:MySQL connectivity is WARNING: Unknown MySQL server host tm-b01-master01.mozilla.org (1) 16:05 < nagios-sjc1> [89] buildbot-master13.build.scl1:MySQL connectivity is WARNING: Unknown MySQL server host tm-b01-master01.mozilla.org (1) So we're still having intermittent issues, but in smaller bursts than previously.
Comment 10•13 years ago
|
||
I just had two try jobs fail around 13:13 with: utils.talosError: "Graph server unreachable (5 attempts)\n(11004, 'getaddrinfo failed')" https://tbpl.mozilla.org/php/getParsedLog.php?id=7573804&tree=Try&full=1 https://tbpl.mozilla.org/php/getParsedLog.php?id=7573784&tree=Try&full=1
Reporter | ||
Comment 11•13 years ago
|
||
And about 20-30 other jobs across mozilla-central and mozilla-inbound failed due to DNS issues, too.
Assignee | ||
Comment 12•13 years ago
|
||
What is the configured resolover for the hosts having issues?
Reporter | ||
Comment 13•13 years ago
|
||
(In reply to Ravi Pina [:ravi] from comment #12) > What is the configured resolover for the hosts having issues? talos-r3-xp-005 is configured as such: C:\Documents and Settings\cltbld>nslookup Default Server: ns1.infra.scl1.mozilla.com Address: 10.12.75.10 I assume the others are as well
Reporter | ||
Comment 14•13 years ago
|
||
And here's more detailed network configuration info: Connection-specific DNS Suffix . : build.scl1.mozilla.com Description . . . . . . . . . . . : NVIDIA nForce 10/100/1000 Mbps Ethernet #2 Physical Address. . . . . . . . . : D4-9A-20-BC-CF-A4 Dhcp Enabled. . . . . . . . . . . : Yes Autoconfiguration Enabled . . . . : Yes IP Address. . . . . . . . . . . . : 10.12.50.113 Subnet Mask . . . . . . . . . . . : 255.255.248.0 Default Gateway . . . . . . . . . : 10.12.48.1 DHCP Server . . . . . . . . . . . : 10.12.75.10 DNS Servers . . . . . . . . . . . : 10.12.75.10 10.12.75.12 Lease Obtained. . . . . . . . . . : Thursday, November 24, 2011 2:11:31 PM Lease Expires . . . . . . . . . . : Friday, November 25, 2011 2:11:31 PM
Comment 15•13 years ago
|
||
From reading nagios alerts, here is more data which may help with debugging: 1) between 03:36-03:52 PST today, we saw connectivity problems with: switch1.r101-{1,2,3,4,5,7,8,9,11,12,13,14}.ops.scl1 switch1.r102-{2,3,4}.ops.scl1 ...which seemed to all (afaict) recover between 03:52-03:55 PST 2) between 13:04-13:15 PST today, we saw connectivity problems with: scl-production-puppet.build.scl1 buildbot-master{04,06,11,12,13,14,15,16,17,18}.build.scl1 hg{1,2}.build.scl1 dev-master01.build.scl1 ...which seemed to all (afaict) recover between 13:15-13:31 PST
Reporter | ||
Comment 16•13 years ago
|
||
I haven't seen any failures since the ones around 1pm PST yesterday. Downgrading severity.
Severity: critical → major
Assignee | ||
Comment 17•13 years ago
|
||
There is a known issue with the version of JUNOS we're running in SCL1. Bug 705823 is open to upgrade to a version which has the fix (among others).
Updated•11 years ago
|
Product: mozilla.org → Infrastructure & Operations
Updated•2 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•