1445580 - issue on OSX machines - possible network connectivity issue

Reporter

Description

•

7 years ago

Being reaching out via IRC. There were discussion on #taskcluster for scrollback - https://mozilla.logbot.info/taskcluster/20180314 07:29:35 <pmoore> morning all 08:01:09 <dluca|sheriffduty> !t-rex: Hello, I'm seeing failure logs on OS X finishing showing Unnamed step, a lots of them, on both inbound and autoland. Can it be machine related ? 08:09:42 <pmoore> dluca|sheriffduty: is the workerId always the same? 08:11:23 <dluca|sheriffduty> pmoore: Nope, IDs are different 08:12:19 <dluca|sheriffduty> pmoore: Is is the same machine type or test type : gecko-t-osx-1010 08:13:33 <pmoore> dluca|sheriffduty: this worker type is unfortunately on a slightly older version of the worker that makes it difficult for treeherder to parse the error messages, but looking at one of them i see 08:13:36 <pmoore> Aborting task - max run time exceeded! 08:13:51 <pmoore> i'll see if this is a common pattern.... 08:14:45 <dluca|sheriffduty> I can see the same thing 08:15:10 <pmoore> this looks like some network slowness: https://treeherder.mozilla.org/logviewer.html#?job_id=167866941&repo=mozilla-inbound&lineNumber=413-427 08:15:41 <pmoore> i suspect problems with network connectivity - these are macs running in our data center .... 08:16:02 <pmoore> grenade: ^ 08:16:30 <pmoore> dluca|sheriffduty: i'm not sure who can support these types of issues at this time 08:16:43 <pmoore> might be worth raising in #moc ? 08:17:15 <dluca|sheriffduty> pmoore: Ok, thank you for looking into it! 08:17:24 <pmoore> yw :) 08:19:42 <pmoore> dluca|sheriffduty: it might be worth retrying some of the failed jobs, in case the network slowness was intermittent 08:20:20 <dluca|sheriffduty> pmoore: Was going to ask about that https://treeherder.mozilla.org/logviewer.html#?job_id=167866941&repo=mozilla-inbound&lineNumber=413-427 From log it says task aborted because hitting max run time. Frankly I'm far away from recognizing the issue. Relops, Could you please take a look at this? Thank you very much.

Justin Lazaro [:jlaz] (use needinfo)

Comment 1

•

7 years ago

Attached file MTR from nagios1.private.releng.mdc1 to tools.taskcluster.net — Details

I've attached an MTR generated from nagios1.private.releng.mdc1 to tools.taskcluster.net It looks like the latency increase is due to routing through Japan and Taiwan. Is this intended?

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Flags: needinfo?(klibby)

Pete Moore [:pmoore][:pete]

Comment 2

•

7 years ago

Hey Chris, from Kendall's irc nick it looks like he might be on PTO - do you know anything about this?

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Comment 4

•

7 years ago

So I remember a bug from late last year (bug 1413585) about how geoip was wrong somehow, and impacted routing from SCL3.

Flags: needinfo?(catlee)

Chris Cooper [:coop] (he/him)

Comment 5

•

7 years ago

Re-directing the NI to Jake who's filling in for :fubar this week.

Flags: needinfo?(klibby) → needinfo?(jwatkins)

Jake Watkins [:dividehex]

Comment 6

•

7 years ago

A few observations here: 1. The host this task ran on (t-yosemite-r7-0214) is now located in MDC2 (it was on a long haul moving truck just last week going from the west coast to the east coast) and really shouldn't be running in production at the moment. The osx hosts that are there still need to be reimaged. Needless to say, we need to disabled the worker on these hosts or quarantine them until mdc2 is 'production' ready. 2. The slowdown seems to be coming from pypi. From the log, you can see pip timing out and retrying downloads which seems to have added enough time for the task to exceed it max runtime. pypi is hosted in SCL3 on the releng web cluster, so it might be worth having Webops take a look at it. If the cluster looks ok, it might have been network connectivity issues between SCL3 and MDC2. 3. I'm having trouble replicating the issues which makes me wonder if it was a transient problem that has since cleared itself up. Are we still seeing this?

Flags: needinfo?(jwatkins)

Jake Watkins [:dividehex]

Updated

•

7 years ago

Depends on: 1445736

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 7

•

7 years ago

This was still an issue an hour ago: https://treeherder.mozilla.org/logviewer.html#?job_id=167992179&repo=mozilla-inbound&lineNumber=397-399

Eric Ziegenhorn :ericz

Comment 8

•

7 years ago

The releng web cluster servers that serve pypi are lightly loaded and the site seems responsive from mdc2 to me using curl. I checked the Apache logs and don't see many errors or anything about lost connections, etc. I'm running a curl loop from a server in mdc2 to look for connection errors.

Andreea Pavel [:apavel]

Comment 9

•

7 years ago

Since the last 17 hours when this was filled, according Neglected oranges, this has 180 occurences and I can confirm that it is still occurring: https://treeherder.mozilla.org/logviewer.html#?job_id=168107489&repo=mozilla-inbound&lineNumber=697-699

Linear Ni-Ya Li [:nli]

Reporter

Updated

•

7 years ago

Group: mozilla-employee-confidential

Pete Moore [:pmoore][:pete]

Comment 10

•

7 years ago

Could this be related to https://github.com/mozilla/DeepSpeech/issues/1289 ?

Flags: needinfo?(lissyx+mozillians)

Phil Ringnalda (:philor)

Updated

•

7 years ago

Blocks: 1445899

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell infra]

:gerard-majax

Comment 11

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #10) > Could this be related to https://github.com/mozilla/DeepSpeech/issues/1289 ? Proper answer: nothing to do with that.

Flags: needinfo?(lissyx+mozillians)

Jake Watkins [:dividehex]

Comment 12

•

7 years ago

I was able to manually reproduce this using the pip on a MDC2 host but not on a MDC1 host. This problem seems unique to MDC2. I've also opened a bug (bug 1446176) with netops to help troubleshoot this issue. (testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# pip -V pip 8.1.2 from /private/var/root/testing/lib/python2.7/site-packages (python 2.7) (testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# pip install --timeout 120 --no-index --find-links http://pypi.pvt.build.mozilla.org/pub --find-links http://pypi.pub.build.mozilla.org/pub --trusted-host pypi.pub.build.mozilla.org --trusted-host pypi.pvt.build.mozilla.org psutil>=3.1.1 Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fca950>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcaad0>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcac50>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcadd0>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcaf50>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub

Eric Ziegenhorn :ericz

Comment 13

•

7 years ago

Can you reproduce it using curl to just fetch a package from pypi? Might be a slightly simpler test case if that's available.

Jake Watkins [:dividehex]

Comment 14

•

7 years ago

MDC2 gecko-t-osx-1010 workers need to stop taking jobs until MDC2 is 'production ready' See: https://bugzilla.mozilla.org/show_bug.cgi?id=1445899#c2

Comment hidden (Intermittent Failures Robot)

Jake Watkins [:dividehex]

Comment 16

•

7 years ago

(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pub.build.mozilla.org/pub % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0 (testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:01:14 --:--:-- 0curl: (7) Failed to connect to pypi.pvt.build.mozilla.org port 80: Operation timed out The connection to pub seems to be ok when using curl but the connection to pvt timesout. Maybe we need to add the MDC2 cidr to a whitelist for pvt?

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

7 years ago

Severity: normal → critical

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Comment 22

•

7 years ago

(In reply to Jake Watkins [:dividehex] from comment #16) > > The connection to pub seems to be ok when using curl but the connection to > pvt timesout. Maybe we need to add the MDC2 cidr to a whitelist for pvt? We had updated the apache configs for everything on the relengweb cluster a while back when we noticed traffic from MDC1 was randomly failing (due to zeus cache misses). But it looks like we all ALSO missed a Zeus rule (releng-net-only) on the internal ZLB which hosts pypi.pvt.build.mozilla.org; in fact, that rule was missing the releng networks in MDC1, MDC2, us-west-2, and us-east-1. I've updated the releng-net-only rule to include those subnets (10.49.0.0/16, 10.51.0.0/16, 10.132.0.0/16, and 10.134.0.0/16). There's also a note in IT puppet's modules/releng/manifests/init.pp: # NOTE: if you change these, you'll also need to change the network_regexps # secret in PuppetAgain. See # https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Secrets and check # with someone from relops/releng. Jake, can you verify that's been updated?

Flags: needinfo?(jwatkins)

Jake Watkins [:dividehex]

Comment 23

•

7 years ago

(In reply to Kendall Libby [:fubar] (PTO Mar 14-18) from comment #22) > We had updated the apache configs for everything on the relengweb cluster a > while back when we noticed traffic from MDC1 was randomly failing (due to > zeus cache misses). But it looks like we all ALSO missed a Zeus rule > (releng-net-only) on the internal ZLB which hosts > pypi.pvt.build.mozilla.org; in fact, that rule was missing the releng > networks in MDC1, MDC2, us-west-2, and us-east-1. I've updated the > releng-net-only rule to include those subnets (10.49.0.0/16, 10.51.0.0/16, > 10.132.0.0/16, and 10.134.0.0/16). > Jake, can you verify that's been updated? I noticed that rule on the internal zlb also but it was set to disabled and the fact that it didn't have most of our DC cidr blocks in it led me to believe it was defunct anyway. I can confirm this is STILL being blocked from MDC2 so I think netops should take a look at this in bug 1446176 . [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pub.build.mozilla.org/pub % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:01:14 --:--:-- 0curl: (7) Failed to connect to pypi.pvt.build.mozilla.org port 80: Operation timed out

Flags: needinfo?(jwatkins)

Geoff Brown [:gbrown]

Updated

•

7 years ago

Blocks: 1411358

Geoff Brown [:gbrown]

Comment 24

•

7 years ago

There are similar issues affecting Linux (and possibly Android) tests, reported in bug 1411358 (alongside failures from other causes). For example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=168901986&lineNumber=1019 [task 2018-03-19T10:52:31.184Z] 10:52:31 INFO - Installing None into virtualenv /builds/worker/workspace/build/venv [task 2018-03-19T10:52:31.187Z] 10:52:31 INFO - error resolving pypi.pvt.build.mozilla.org (ignoring): [taskcluster:error] Task timeout after 3600 seconds. Force killing container.

Jake Watkins [:dividehex]

Comment 25

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #24) > There are similar issues affecting Linux (and possibly Android) tests, > reported in bug 1411358 (alongside failures from other causes). For example: > > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > inbound&job_id=168901986&lineNumber=1019 > > [task 2018-03-19T10:52:31.184Z] 10:52:31 INFO - Installing None into > virtualenv /builds/worker/workspace/build/venv > [task 2018-03-19T10:52:31.187Z] 10:52:31 INFO - error resolving > pypi.pvt.build.mozilla.org (ignoring): > > [taskcluster:error] Task timeout after 3600 seconds. Force killing container. I'm not entirely sure this is related. This looks like it is from a aws tc worker (as opposed to a hardware working) and it is a failure to resolve the dns name for pypi's external VIPs rather than connect. But it is still something that needs to be looked into.

Jake Watkins [:dividehex]

Comment 27

•

7 years ago

Netops has fixed this issue in bug 1446176. And I have confirmed the traffic is no longer being block in MCD2. [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 246 100 246 0 0 357 0 --:--:-- --:--:-- --:--:-- 357

Comment hidden (Intermittent Failures Robot)

Jake Watkins [:dividehex]

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)