Closed
Bug 1443130
Opened 7 years ago
Closed 7 years ago
Intermittent [taskcluster:error] Task killed because maxRunTime was exceeded
Categories
(Testing :: Talos, defect, P5)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1453007
People
(Reporter: aryx, Assigned: rwood)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])
Attachments
(1 obsolete file)
+++ This bug was initially created as a clone of Bug #1442736 +++
central-as-beta simulation hit this:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7c4bd5b1fb1caefb97d75dec60bfa3e78a61c03&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable&selectedJob=165959208
03:24:35 INFO - TEST-INFO | started process 31919 (/home/cltbld/workspace/build/application/firefox/firefox -profile /tmp/tmplXSJYV/profile)
03:24:35 INFO - PID 31919 | MOZ_EVENT_TRACE start 1520249075864
03:24:35 INFO - PID 31919 | MOZ_EVENT_TRACE sample 1520249075931 35.656119
03:24:36 INFO - PID 31919 | MOZ_EVENT_TRACE sample 1520249076146 166.923099
03:24:36 INFO - PID 31919 | 1520249076146 addons.webextension.talos@mozilla.org WARN Please specify whether you want browser_style or not in your page_action options.
03:24:36 INFO - PID 31919 | 1520249076148 addons.webextension.talos@mozilla.org WARN Please specify whether you want browser_style or not in your browser_action options.
03:24:36 INFO - PID 31919 | MOZ_EVENT_TRACE sample 1520249076205 37.322715
03:24:36 INFO - PID 31919 | MOZ_EVENT_TRACE sample 1520249076226 20.881010
03:24:36 INFO - PID 31919 | MOZ_EVENT_TRACE sample 1520249076307 46.899740
03:24:36 INFO - PID 31919 |
03:24:36 INFO - PID 31919 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
03:24:36 INFO - PID 31919 |
03:24:36 INFO - PID 31919 | MOZ_EVENT_TRACE sample 1520249076957 35.354026
03:24:37 INFO - PID 31919 |
03:24:37 INFO - PID 31919 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
03:24:37 INFO - PID 31919 |
03:24:37 INFO - PID 31919 |
03:24:37 INFO - PID 31919 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
03:24:37 INFO - PID 31919 |
Comment hidden (Intermittent Failures Robot) |
Comment 2•7 years ago
|
||
there are many instances of this failure on linux over the weekend and today- the reason why- it takes 20 minutes to download target.tar.bz2:
https://taskcluster-artifacts.net/V7M8ep8JQQG1X2u8k3zOpA/0/public/logs/live_backing.log:
18:32:06 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.talos.tests.zip'}, attempt #1
18:32:06 INFO - Fetch https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.talos.tests.zip into memory
18:32:25 INFO - Content-Length response header: 14031842
18:32:25 INFO - Bytes received: 14031842
18:32:26 INFO - Downloading https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.tar.bz2 to /home/cltbld/workspace/build/target.tar.bz2
18:32:26 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.tar.bz2', 'file_name': '/home/cltbld/workspace/build/target.tar.bz2'}, attempt #1
18:55:13 INFO - Downloaded 62092291 bytes.
18:55:13 INFO - Setting buildbot property build_url to https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.tar.bz2
18:55:13 INFO - Writing buildbot properties ['build_url'] to /home/cltbld/workspace/properties/build_url
18:55:13 INFO - Writing to file /home/cltbld/workspace/properties/build_url
but in a normal passing scenario (https://taskcluster-artifacts.net/PR_Wu-3oRiiABz3oPovA-w/0/public/logs/live_backing.log):
09:42:59 INFO - Fetch https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.talos.tests.zip into memory
09:43:02 INFO - Content-Length response header: 14031842
09:43:02 INFO - Bytes received: 14031842
09:43:02 INFO - Downloading https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.tar.bz2 to /home/cltbld/workspace/build/target.tar.bz2
09:43:02 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.tar.bz2', 'file_name': '/home/cltbld/workspace/build/target.tar.bz2'}, attempt #1
09:43:07 INFO - Downloaded 62092291 bytes.
09:43:07 INFO - Setting buildbot property build_url to https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.tar.bz2
09:43:07 INFO - Writing buildbot properties ['build_url'] to /home/cltbld/workspace/properties/build_url
I cannot think of anyway we would need 20 minutes to download a build- could something be wrong with our network or the machines? These are physical linux machines in the datacenter.
Flags: needinfo?(gps)
Flags: needinfo?(dustin)
Flags: needinfo?(ddurst)
Comment 3•7 years ago
|
||
That's about 50KB/s :(
I don't know why that would happen.
Flags: needinfo?(dustin)
Comment 4•7 years ago
|
||
Note that these are mozharness timestamps and mozharness timestamps need to be taken with a grain of salt. The reason is that mozharness adds the current wall time when it processes a log event or line of output from an invoked process. That's all fine. However, mozharness frequently runs processes with buffered output. So, output from an invoked process could get buffered for seconds or minutes before mozharness consumes it. Then mozharness will consume several lines at once and attribute them to the same time.
Whether that is happening here, I'm not sure. I /think/ the logged events are coming directly from mozharness ("[mozharness: 2018-03-12 01:31:48.721404Z] Running download-and-extract step."), which means there shouldn't be an event buffering problem.
Anyway, several minutes to download these files is a bit concerning. It is likely the downloading part that is slow. But you can't rule out local filesystem I/O being borked as well.
I'm not sure I can recommend any specific steps. Maybe we should start collecting better metrics about downloads so we know how prevalent problems like this are?
Flags: needinfo?(gps)
Comment hidden (Intermittent Failures Robot) |
Comment 6•7 years ago
|
||
It seems that the failure rate is increasing https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1443130&endday=2018-03-13&startday=2018-03-10&tree=trunk
Flags: needinfo?(gps)
Reporter | ||
Comment 7•7 years ago
|
||
Fubar, can you check if the downloads of those files are slow and if yes if other downloads at the same time are also slow?
from IRC:
jmaher: Aryx: pmoore: I worked with cosmin yesterday and we noticed a 20 minute lapse in the logs at the time of downloading the build
Aryx: for one os x debug build we had download times of 40 minutes, but else it seems to be linux talos
Flags: needinfo?(klibby)
Comment 8•7 years ago
|
||
It might also be the DC proxies. We're taking a look.
Flags: needinfo?(klibby)
Comment 9•7 years ago
|
||
[root@t-linux64-ms-183 ~]# wget https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.tar.bz2; rm target.tar.bz2
--2018-03-13 08:52:23-- https://queue.taskcluster.net/v1/task/IqoodGkwR7ut6F-eCup1PQ/artifacts/public/build/target.tar.bz2
Resolving queue.taskcluster.net (queue.taskcluster.net)... 50.17.218.87, 50.16.228.78, 107.22.197.53
Connecting to queue.taskcluster.net (queue.taskcluster.net)|50.17.218.87|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://taskcluster-artifacts.net/IqoodGkwR7ut6F-eCup1PQ/0/public/build/target.tar.bz2 [following]
--2018-03-13 08:52:23-- https://taskcluster-artifacts.net/IqoodGkwR7ut6F-eCup1PQ/0/public/build/target.tar.bz2
Resolving taskcluster-artifacts.net (taskcluster-artifacts.net)... 54.192.212.56
Connecting to taskcluster-artifacts.net (taskcluster-artifacts.net)|54.192.212.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 62092291 (59M) [application/x-bzip2]
Saving to: 'target.tar.bz2'
target.tar.bz2 100%[==============================>] 59.22M 18.1MB/s in 4.0s
Same host as in #c2, re-run several times. It looks like we're NOT using the DC proxies; I'm not sure if that's be design or accident, as I could have sworn that we were but enabling them gets a '403 Forbidden' error from them.
The only changes we've made recently (Mar 6) to the ubuntu16.04 config on the moonshots was to switch syslog back to using 514/tcp instead of udp.
Comment 10•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #2)
> there are many instances of this failure on linux over the weekend and
> today- the reason why- it takes 20 minutes to download target.tar.bz2:
I'm going do disagree with your assertion. There ARE cases where it looks to take 20 minutes, but there are also cases where we exceed max run time and this download either doesn't happen or happens very fast (based on the failures listed at https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1443130&endday=2018-03-13&startday=2018-03-10&tree=trunk)
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167684633&lineNumber=345-347
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167684702&lineNumber=344-346
20 seconds
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167684622&lineNumber=40700
non-existant?
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=167682697&lineNumber=337
I don't even; talos test runs for 2 seconds before maxRunTime?!
I'm not saying we DON'T have a problem on linux/moonshots, but if we do I don't think it's clear what it is.
Comment 11•7 years ago
|
||
actually the first link you have [1] has a 20 minute download of common.tests.zip:
07:56:16 INFO - Downloading and extracting to /home/cltbld/workspace/build/tests these dirs * from https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip
07:56:16 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip'}, attempt #1
07:56:16 INFO - Fetch https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip into memory
08:08:54 INFO - Content-Length response header: 56969757
08:08:54 INFO - Bytes received: 56969757
and the 2nd link [2] has 20 minutes for a couple downloads:
07:57:49 INFO - u'xpcshell': [u'target.common.tests.zip', u'target.xpcshell.tests.zip']}
07:57:49 INFO - Downloading packages: [u'target.common.tests.zip', u'target.talos.tests.zip'] for test suite categories: ['common', 'talos']
07:57:49 INFO - Downloading and extracting to /home/cltbld/workspace/build/tests these dirs * from https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip
07:57:49 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip'}, attempt #1
07:57:49 INFO - Fetch https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip into memory
08:08:53 INFO - Content-Length response header: 56969757
08:08:53 INFO - Bytes received: 56969757
08:08:58 INFO - Downloading and extracting to /home/cltbld/workspace/build/tests these dirs * from https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.talos.tests.zip
08:08:58 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.talos.tests.zip'}, attempt #1
08:08:58 INFO - Fetch https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.talos.tests.zip into memory
08:09:07 INFO - Content-Length response header: 14052878
08:09:07 INFO - Bytes received: 14052878
08:09:07 INFO - Downloading https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.tar.bz2 to /home/cltbld/workspace/build/target.tar.bz2
08:09:07 INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.tar.bz2', 'file_name': '/home/cltbld/workspace/build/target.tar.bz2'}, attempt #1
08:09:26 INFO - Downloaded 62046856 bytes.
#3 has a 10 minute download [3]:
07:56:33 INFO - Downloading packages: [u'target.common.tests.zip', u'target.talos.tests.zip'] for test suite categories: ['common', 'talos']
07:56:33 INFO - Downloading and extracting to /home/cltbld/workspace/build/tests these dirs * from https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip
07:56:33 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip'}, attempt #1
07:56:33 INFO - Fetch https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.common.tests.zip into memory
08:08:54 INFO - Content-Length response header: 56969757
08:08:54 INFO - Bytes received: 56969757
08:08:58 INFO - Downloading and extracting to /home/cltbld/workspace/build/tests these dirs * from https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.talos.tests.zip
08:08:58 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.talos.tests.zip'}, attempt #1
08:08:58 INFO - Fetch https://queue.taskcluster.net/v1/task/XulAjx9oQHuF-0lQG5pJyA/artifacts/public/build/target.talos.tests.zip into memory
08:09:07 INFO - Content-Length response header: 140
the 4th link [4], has a maxRunTime of 15 minutes:
https://searchfox.org/mozilla-central/source/taskcluster/ci/test/talos.yml#275
typically this takes 6 minutes to complete, I think 15 minutes is enough overhead for retrying a download or two.
this specific log might indicate there is a longer lag in the bootstrapping of linux to get up and running. It is odd there is <2 minutes of mozharness runtime before the 15 minutes expire.
In addition to the above, we do see maxRunTime hit in many cases where a test hangs and we have to kill it- while that might be hundreds of times/week across all OS, the current rate of failures on linux seem to be related to longer download times.
[1] https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167684633&lineNumber=345-347
[2] https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167684702&lineNumber=344-346
[3] https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167684622&lineNumber=40700
[4] https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=167682697&lineNumber=337
Comment 13•7 years ago
|
||
It might be worth checking for firewall state table overflow. IIRC we had issues with that years ago, where a session would get evicted from the firewall and thus the firewall would drop all traffic on that TCP/IP quad without any RSTs or anything. The end of the connection waiting for data (the HTTP client, in this case) can then sit for a long time waiting for data that is never coming -- there's no TCP packet to say "hey, did you have more data for me or are you dead?"
That said, it looks like the download does eventually complete with the right number of bytes, so this guess doesn't fit all of the facts..
Comment 14•7 years ago
|
||
There's not much context I can give on this bug. I suspect the problem is in the platform/network and I am far from an expert in these areas.
Flags: needinfo?(gps)
Comment 15•7 years ago
|
||
Joel, thanks for more info! I tried looking for long gaps in logging for those sorts of things, but clearly missed some. Setting a NI on :dragrom to take a look.
Flags: needinfo?(dcrisan)
Comment hidden (Intermittent Failures Robot) |
Comment 17•7 years ago
|
||
Tested on t-linux64-ms-059 and t-linux64-ms-183 servers for more times:
On t-linux64-ms-183 server:
[root@t-linux64-ms-183 ~]# wget https://queue.taskcluster.net/v1/task/RCCR58sJSHSdiqz39nA12Q/artifacts/public/build/target.tar.bz2; rm target.tar.bz2
--2018-03-14 05:10:21-- https://queue.taskcluster.net/v1/task/RCCR58sJSHSdiqz39nA12Q/artifacts/public/build/target.tar.bz2
Resolving queue.taskcluster.net (queue.taskcluster.net)... 50.17.218.87, 107.22.197.53, 50.16.228.78
Connecting to queue.taskcluster.net (queue.taskcluster.net)|50.17.218.87|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://taskcluster-artifacts.net/RCCR58sJSHSdiqz39nA12Q/0/public/build/target.tar.bz2 [following]
--2018-03-14 05:10:21-- https://taskcluster-artifacts.net/RCCR58sJSHSdiqz39nA12Q/0/public/build/target.tar.bz2
Resolving taskcluster-artifacts.net (taskcluster-artifacts.net)... 54.192.212.56
Connecting to taskcluster-artifacts.net (taskcluster-artifacts.net)|54.192.212.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61989054 (59M) [application/x-bzip2]
Saving to: 'target.tar.bz2'
target.tar.bz2 100%[================================================================================================================>] 59.12M 9.50MB/s in 11s
2018-03-14 05:10:35 (5.36 MB/s) - 'target.tar.bz2' saved [61989054/61989054]
On next try:
[root@t-linux64-ms-183 ~]# wget https://queue.taskcluster.net/v1/task/RCCR58sJSHSdiqz39nA12Q/artifacts/public/build/target.tar.bz2; rm target.tar.bz2
--2018-03-14 05:12:47-- https://queue.taskcluster.net/v1/task/RCCR58sJSHSdiqz39nA12Q/artifacts/public/build/target.tar.bz2
Resolving queue.taskcluster.net (queue.taskcluster.net)... 50.16.228.78, 107.22.197.53, 50.17.218.87
Connecting to queue.taskcluster.net (queue.taskcluster.net)|50.16.228.78|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://taskcluster-artifacts.net/RCCR58sJSHSdiqz39nA12Q/0/public/build/target.tar.bz2 [following]
--2018-03-14 05:12:47-- https://taskcluster-artifacts.net/RCCR58sJSHSdiqz39nA12Q/0/public/build/target.tar.bz2
Resolving taskcluster-artifacts.net (taskcluster-artifacts.net)... 54.192.212.56
Connecting to taskcluster-artifacts.net (taskcluster-artifacts.net)|54.192.212.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61989054 (59M) [application/x-bzip2]
Saving to: 'target.tar.bz2'
target.tar.bz2 100%[================================================================================================================>] 59.12M 18.5MB/s in 3.9s
2018-03-14 05:12:52 (15.1 MB/s) - 'target.tar.bz2' saved [61989054/61989054]
On t-linux64-ms-059 server:
[root@t-linux64-ms-059 ~]# wget https://queue.taskcluster.net/v1/task/RCCR58sJSHSdiqz39nA12Q/artifacts/public/build/target.tar.bz2; rm target.tar.bz2
--2018-03-14 05:11:18-- https://queue.taskcluster.net/v1/task/RCCR58sJSHSdiqz39nA12Q/artifacts/public/build/target.tar.bz2
Resolving queue.taskcluster.net (queue.taskcluster.net)... 107.22.197.53, 50.16.228.78, 50.17.218.87
Connecting to queue.taskcluster.net (queue.taskcluster.net)|107.22.197.53|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://taskcluster-artifacts.net/RCCR58sJSHSdiqz39nA12Q/0/public/build/target.tar.bz2 [following]
--2018-03-14 05:11:18-- https://taskcluster-artifacts.net/RCCR58sJSHSdiqz39nA12Q/0/public/build/target.tar.bz2
Resolving taskcluster-artifacts.net (taskcluster-artifacts.net)... 54.192.212.56
Connecting to taskcluster-artifacts.net (taskcluster-artifacts.net)|54.192.212.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61989054 (59M) [application/x-bzip2]
Saving to: 'target.tar.bz2'
target.tar.bz2 100%[================================================================================================================>] 59.12M 18.4MB/s in 3.9s
2018-03-14 05:11:23 (15.1 MB/s) - 'target.tar.bz2' saved [61989054/61989054]
[root@t-linux64-ms-059 ~]# wget https://queue.taskcluster.net/v1/task/RCCR58sJSHSdiqz39nA12Q/artifacts/public/build/target.tar.bz2; rm target.tar.bz2
--2018-03-14 05:14:28-- https://queue.taskcluster.net/v1/task/RCCR58sJSHSdiqz39nA12Q/artifacts/public/build/target.tar.bz2
Resolving queue.taskcluster.net (queue.taskcluster.net)... 107.22.197.53, 50.16.228.78, 50.17.218.87
Connecting to queue.taskcluster.net (queue.taskcluster.net)|107.22.197.53|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://taskcluster-artifacts.net/RCCR58sJSHSdiqz39nA12Q/0/public/build/target.tar.bz2 [following]
--2018-03-14 05:14:28-- https://taskcluster-artifacts.net/RCCR58sJSHSdiqz39nA12Q/0/public/build/target.tar.bz2
Resolving taskcluster-artifacts.net (taskcluster-artifacts.net)... 54.192.212.56
Connecting to taskcluster-artifacts.net (taskcluster-artifacts.net)|54.192.212.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61989054 (59M) [application/x-bzip2]
Saving to: 'target.tar.bz2'
target.tar.bz2 100%[================================================================================================================>] 59.12M 18.0MB/s in 4.0s
2018-03-14 05:14:33 (14.8 MB/s) - 'target.tar.bz2' saved [61989054/61989054]
Flags: needinfo?(dcrisan)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 21•7 years ago
|
||
Hello! Since last night, this issue seems to have increased again, from the 16th to the 17th having 505 failures, the majority on Linux x64: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1443130&startday=2018-03-16&endday=2018-03-17&tree=all (the 2 on OS X, 1 on Linux 32 and 1 Linux are misclassified)
:fubar can you take a look, or do you have an update here?
Thank you!
Flags: needinfo?(klibby)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 24•7 years ago
|
||
Someone commented elsewhere that some of these issues may be related to an intermittent issue with AWS and geoip, where traffic gets routed in very inconvenient ways. I just did a very quick test on t-linux64-ms-183, tracerouting to queue.taskcluster.net, and traffic went from MDC1 (Sacramento), to Miami, and back to San Jose:
[root@t-linux64-ms-183 ~]# traceroute -T queue.taskcluster.net
traceroute to queue.taskcluster.net (50.16.222.244), 30 hops max, 60 byte packets
1 ae1-256.fw1.test.releng.mdc1.mozilla.net (10.49.56.1) 0.502 ms 0.489 ms 0.473 ms
2 63.245.208.17 (63.245.208.17) 0.624 ms 0.713 ms 0.624 ms
3 65.74.145.154 (65.74.145.154) 0.988 ms 1.110 ms 1.290 ms
4 173.225.175.145 (173.225.175.145) 1.084 ms 2.459 ms 2.538 ms
5 ip65-46-225-53.z225-46-65.customer.algx.net (65.46.225.53) 1.333 ms 1.232 ms 1.324 ms
6 216.156.16.58.ptr.us.xo.net (216.156.16.58) 43.155 ms 43.204 ms 43.194 ms
7 te-4-1-0.rar3.miami-fl.us.xo.net (207.88.12.161) 69.219 ms 58.565 ms 58.509 ms
8 207.88.12.144.ptr.us.xo.net (207.88.12.144) 43.336 ms 43.272 ms 43.246 ms
9 207.88.12.190.ptr.us.xo.net (207.88.12.190) 46.365 ms 46.360 ms 50.060 ms
10 te0-12-0-0.rar3.sanjose-ca.us.xo.net (207.88.12.189) 44.742 ms 44.756 ms 44.728 ms
11 207.88.12.194.ptr.us.xo.net (207.88.12.194) 43.074 ms 43.606 ms 43.512 ms
12 207.88.14.199.ptr.us.xo.net (207.88.14.199) 43.529 ms 43.528 ms 42.993 ms
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 54.239.108.190 (54.239.108.190) 73.449 ms 54.239.108.54 (54.239.108.54) 71.447 ms 54.239.108.182 (54.239.108.182) 72.531 ms
21 54.239.110.190 (54.239.110.190) 70.751 ms * 54.239.110.140 (54.239.110.140) 69.767 ms
22 54.239.110.167 (54.239.110.167) 91.087 ms 54.239.110.247 (54.239.110.247) 86.938 ms 54.239.110.183 (54.239.110.183) 77.929 ms
23 54.239.111.95 (54.239.111.95) 74.550 ms 54.239.111.87 (54.239.111.87) 73.561 ms 54.239.111.89 (54.239.111.89) 73.345 ms
24 * * *
25 205.251.244.95 (205.251.244.95) 73.378 ms * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
Flags: needinfo?(klibby)
Comment 25•7 years ago
|
||
I didn't realize the artifact download were coming from cloudfront. Erroneous GeoIP data could be at least one factor here.
<dividehex> dustin: where does taskcluster-artifacts.net reside?
<dividehex> is that a CDN?
<dustin> dividehex: yes
<dustin> cloudfront
<dividehex> ahh ok good!
<dividehex> that means GeoIP would have an effect
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 29•7 years ago
|
||
From 2nd of April this started to increase again - 36 failures
In the last 7 days we have 41 failures. They occur on Linux x64 and the affected builds type are opt and pgo.
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=171782517&lineNumber=13678
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Assignee | ||
Comment 35•7 years ago
|
||
Assignee | ||
Comment 36•7 years ago
|
||
Assignee | ||
Comment 37•7 years ago
|
||
Comment on attachment 8966674 [details]
Bug 1443130 - Allow more time for talos g2/g2 profiling to fix intermittent maxRunTime exceeded;
Nope this doesn't solve the intermittent.
Attachment #8966674 -
Attachment is obsolete: true
Attachment #8966674 -
Flags: review?(jmaher)
Comment hidden (Intermittent Failures Robot) |
Comment 39•7 years ago
|
||
we disabled tps
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•