Closed Bug 1445580 Opened 3 years ago Closed 3 years ago

issue on OSX machines - possible network connectivity issue

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nli, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [stockwell infra])

Attachments

(1 file)

Being reaching out via IRC. 
There were discussion on #taskcluster
for scrollback - https://mozilla.logbot.info/taskcluster/20180314


07:29:35 <pmoore> morning all
08:01:09 <dluca|sheriffduty> !t-rex: Hello, I'm seeing failure logs on OS X  finishing showing Unnamed step, a lots of them, on both inbound and autoland. Can it be machine related ?
08:09:42 <pmoore> dluca|sheriffduty: is the workerId always the same?
08:11:23 <dluca|sheriffduty> pmoore: Nope, IDs are different
08:12:19 <dluca|sheriffduty> pmoore: Is is the same machine type or test type : gecko-t-osx-1010
08:13:33 <pmoore> dluca|sheriffduty: this worker type is unfortunately on a slightly older version of the worker that makes it difficult for treeherder to parse the error messages, but looking at one of them i see
08:13:36 <pmoore> Aborting task - max run time exceeded!
08:13:51 <pmoore> i'll see if this is a common pattern....
08:14:45 <dluca|sheriffduty> I can see the same thing
08:15:10 <pmoore> this looks like some network slowness: https://treeherder.mozilla.org/logviewer.html#?job_id=167866941&repo=mozilla-inbound&lineNumber=413-427
08:15:41 <pmoore> i suspect problems with network connectivity - these are macs running in our data center ....
08:16:02 <pmoore> grenade: ^
08:16:30 <pmoore> dluca|sheriffduty: i'm not sure who can support these types of issues at this time
08:16:43 <pmoore> might be worth raising in #moc ?
08:17:15 <dluca|sheriffduty> pmoore: Ok, thank you for looking into it!
08:17:24 <pmoore> yw :)
08:19:42 <pmoore> dluca|sheriffduty: it might be worth retrying some of the failed jobs, in case the network slowness was intermittent
08:20:20 <dluca|sheriffduty> pmoore: Was going to ask about that



https://treeherder.mozilla.org/logviewer.html#?job_id=167866941&repo=mozilla-inbound&lineNumber=413-427

From log it says task aborted because hitting max run time.

Frankly I'm far away from recognizing the issue. 

Relops, 

Could you please take a look at this?

Thank you very much.
I've attached an MTR generated from nagios1.private.releng.mdc1 to tools.taskcluster.net

It looks like the latency increase is due to routing through Japan and Taiwan.  Is this intended?
Flags: needinfo?(klibby)
Hey Chris, from Kendall's irc nick it looks like he might be on PTO - do you know anything about this?
Flags: needinfo?(catlee)
So I remember a bug from late last year (bug 1413585) about how geoip was wrong somehow, and impacted routing from SCL3.
Flags: needinfo?(catlee)
Re-directing the NI to Jake who's filling in for :fubar this week.
Flags: needinfo?(klibby) → needinfo?(jwatkins)
A few observations here:

1. The host this task ran on (t-yosemite-r7-0214) is now located in MDC2 (it was on a long haul moving truck just last week going from the west coast to the east coast) and really shouldn't be running in production at the moment.  The osx hosts that are there still need to be reimaged.  Needless to say, we need to disabled the worker on these hosts or quarantine them until mdc2 is 'production' ready.

2. The slowdown seems to be coming from pypi.  From the log, you can see pip timing out and retrying downloads which seems to have added enough time for the task to exceed it max runtime.  pypi is hosted in SCL3 on the releng web cluster, so it might be worth having Webops take a look at it.  If the cluster looks ok, it might have been network connectivity issues between SCL3 and MDC2.

3. I'm having trouble replicating the issues which makes me wonder if it was a transient problem that has since cleared itself up.  Are we still seeing this?
Flags: needinfo?(jwatkins)
Depends on: 1445736
The releng web cluster servers that serve pypi are lightly loaded and the site seems responsive from mdc2 to me using curl.  I checked the Apache logs and don't see many errors or anything about lost connections, etc. I'm running a curl loop from a server in mdc2 to look for connection errors.
Since the last 17 hours when this was filled, according Neglected oranges, this has 180 occurences and I can confirm that it is still occurring: https://treeherder.mozilla.org/logviewer.html#?job_id=168107489&repo=mozilla-inbound&lineNumber=697-699
Group: mozilla-employee-confidential
Could this be related to https://github.com/mozilla/DeepSpeech/issues/1289 ?
Flags: needinfo?(lissyx+mozillians)
Blocks: 1445899
Whiteboard: [stockwell infra]
(In reply to Pete Moore [:pmoore][:pete] from comment #10)
> Could this be related to https://github.com/mozilla/DeepSpeech/issues/1289 ?

Proper answer: nothing to do with that.
Flags: needinfo?(lissyx+mozillians)
I was able to manually reproduce this using the pip on a MDC2 host but not on a MDC1 host.  This problem seems unique to MDC2.  I've also opened a bug (bug 1446176) with netops to help troubleshoot this issue.

(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# pip -V
pip 8.1.2 from /private/var/root/testing/lib/python2.7/site-packages (python 2.7)


(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# pip install --timeout 120 --no-index --find-links http://pypi.pvt.build.mozilla.org/pub --find-links http://pypi.pub.build.mozilla.org/pub --trusted-host pypi.pub.build.mozilla.org --trusted-host pypi.pvt.build.mozilla.org psutil>=3.1.1
  Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fca950>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
  Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcaad0>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
  Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcac50>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
  Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcadd0>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
  Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcaf50>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
Can you reproduce it using curl to just fetch a package from pypi?  Might be a slightly simpler test case if that's available.
MDC2 gecko-t-osx-1010 workers need to stop taking jobs until MDC2 is 'production ready'

See: https://bugzilla.mozilla.org/show_bug.cgi?id=1445899#c2
(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pub.build.mozilla.org/pub
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:01:14 --:--:--     0curl: (7) Failed to connect to pypi.pvt.build.mozilla.org port 80: Operation timed out

The connection to pub seems to be ok when using curl but the connection to pvt timesout.  Maybe we need to add the MDC2 cidr to a whitelist for pvt?
Severity: normal → critical
Duplicate of this bug: 1440995
Duplicate of this bug: 1446608
(In reply to Jake Watkins [:dividehex] from comment #16)
>
> The connection to pub seems to be ok when using curl but the connection to
> pvt timesout.  Maybe we need to add the MDC2 cidr to a whitelist for pvt?

We had updated the apache configs for everything on the relengweb cluster a while back when we noticed traffic from MDC1 was randomly failing (due to zeus cache misses). But it looks like we all ALSO missed a Zeus rule (releng-net-only) on the internal ZLB which hosts pypi.pvt.build.mozilla.org; in fact, that rule was missing the releng networks in MDC1, MDC2, us-west-2, and us-east-1. I've updated the releng-net-only rule to include those subnets (10.49.0.0/16, 10.51.0.0/16, 10.132.0.0/16, and 10.134.0.0/16).

There's also a note in IT puppet's modules/releng/manifests/init.pp:
    # NOTE: if you change these, you'll also need to change the network_regexps
    # secret in PuppetAgain.  See
    # https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Secrets and check
    # with someone from relops/releng.

Jake, can you verify that's been updated?
Flags: needinfo?(jwatkins)
(In reply to Kendall Libby [:fubar] (PTO Mar 14-18) from comment #22)


> We had updated the apache configs for everything on the relengweb cluster a
> while back when we noticed traffic from MDC1 was randomly failing (due to
> zeus cache misses). But it looks like we all ALSO missed a Zeus rule
> (releng-net-only) on the internal ZLB which hosts
> pypi.pvt.build.mozilla.org; in fact, that rule was missing the releng
> networks in MDC1, MDC2, us-west-2, and us-east-1. I've updated the
> releng-net-only rule to include those subnets (10.49.0.0/16, 10.51.0.0/16,
> 10.132.0.0/16, and 10.134.0.0/16).


> Jake, can you verify that's been updated?

I noticed that rule on the internal zlb also but it was set to disabled and the fact that it didn't have most of our DC cidr blocks in it led me to believe it was defunct anyway.  I can confirm this is STILL being blocked from MDC2 so I think netops should take a look at this in bug 1446176 .

[root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pub.build.mozilla.org/pub
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
[root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:01:14 --:--:--     0curl: (7) Failed to connect to pypi.pvt.build.mozilla.org port 80: Operation timed out
Flags: needinfo?(jwatkins)
Blocks: 1411358
There are similar issues affecting Linux (and possibly Android) tests, reported in bug 1411358 (alongside failures from other causes). For example:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=168901986&lineNumber=1019

[task 2018-03-19T10:52:31.184Z] 10:52:31     INFO - Installing None into virtualenv /builds/worker/workspace/build/venv
[task 2018-03-19T10:52:31.187Z] 10:52:31     INFO - error resolving pypi.pvt.build.mozilla.org (ignoring): 

[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
(In reply to Geoff Brown [:gbrown] from comment #24)
> There are similar issues affecting Linux (and possibly Android) tests,
> reported in bug 1411358 (alongside failures from other causes). For example:
> 
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=168901986&lineNumber=1019
> 
> [task 2018-03-19T10:52:31.184Z] 10:52:31     INFO - Installing None into
> virtualenv /builds/worker/workspace/build/venv
> [task 2018-03-19T10:52:31.187Z] 10:52:31     INFO - error resolving
> pypi.pvt.build.mozilla.org (ignoring): 
> 
> [taskcluster:error] Task timeout after 3600 seconds. Force killing container.

I'm not entirely sure this is related.  This looks like it is from a aws tc worker (as opposed to a hardware working) and it is a failure to resolve the dns name for pypi's external VIPs rather than connect.  But it is still something that needs to be looked into.
Duplicate of this bug: 1446784
Netops has fixed this issue in bug 1446176.  And I have confirmed the traffic is no longer being block in MCD2.


[root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   246  100   246    0     0    357      0 --:--:-- --:--:-- --:--:--   357
Duplicate of this bug: 1443394
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.