Closed
Bug 1499246
Opened 7 years ago
Closed 7 years ago
ERROR - The following files failed: 'host-utils-61.0a1.en-US.linux-x86_64.tar.gz | ERROR - The following files failed: 'linux64-minidump_stackwalk'
Categories
(Testing :: General, defect)
Testing
General
Tracking
(firefox64 fixed, firefox65 fixed)
RESOLVED
FIXED
mozilla65
People
(Reporter: bc, Assigned: bc)
References
Details
(Whiteboard: [stockwell infra])
Attachments
(3 files)
Beginning at Tue, Oct 9, 12:15:26 with https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed,busted,exception,runnable&tier=1,2,3&group_state=expanded&searchStr=android-hw&revision=4a92019d1ad6be91a38aed8033b5b85d1654dd78 the android-hw tests began to experience frequent intermittent failures to fetch files on try jobs which are most frequently run.
https://treeherder.mozilla.org/logviewer.html#?job_id=204314059&repo=try&lineNumber=795
The first mozilla-central job where this occurred was at Tue, Oct 9, 09:24:03 with https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=1%2C2%2C3&group_state=expanded&searchStr=android-hw&selectedJob=204295629&revision=e96bcfe8669abdb7eaa9f034daba53d44d8c3e51
The first autoland job where this occured was at Tue, Oct 9, 09:29:58 with https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=1%2C2%2C3&group_state=expanded&searchStr=android-hw&selectedJob=204270932&revision=1c93105605f888686118d498efacc64e92080a63
The first mozilla-inbound job where this occurred was at Tue, Oct 9, 08:04:39 with https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=1%2C2%2C3&group_state=expanded&searchStr=android-hw&selectedJob=204284415&revision=d49a5d674e007fc79beda9437889ca5a1eec4aa2
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 8•7 years ago
|
||
bug 1501364 disabled android-hw on trunk repos including try though beta was left enabled and many try pushes continued to run due to stale trees from developers.
While testing android-hw to see if we could re-enable it I pushed two try pushes with --rebuild 20 for android-hw tests:
<https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&group_state=expanded&fromchange=800c199f3f27deb30b1677f775fdd7995fd56b43&tochange=bcf2ac55096d9c48067938882dafa1fe96d2dd22>
This download error was much less frequent though it did still occur.
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#713 sets up the parameters for retrying the download. If we fail to get the file one or more times this causes the ERROR - The following files failed: message to be emitted even though we may actually get the file. If this were a mere warning, it would not cause the job to go orange. Perhaps we can relax that. It doesn't appear from the logs that we are getting one of the retry exceptions but instead are getting a non zero status code but do not log it. I'll try a patch to see what status code is actually being returned.
Another option to work around this for bitbar is to just bake the hostutils and minidump_stackwalk into the bitbar container and not attempt to download them each time.
Assignee | ||
Comment 9•7 years ago
|
||
I made a slight change to log the exception when tooltool.py's fetch fails and found
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&group_state=expanded&fromchange=1f676fd17e982396b59472f9746a48e173a37575&selectedJob=207586835
https://treeherder.mozilla.org/logviewer.html#?job_id=207586835&repo=try&lineNumber=800
18:10:09 INFO - INFO - Attempting to fetch from 'https://tooltool.mozilla-releng.net/'...
18:10:19 INFO - INFO - ...failed to fetch 'linux64-minidump_stackwalk' from https://tooltool.mozilla-releng.net/
18:10:19 INFO - INFO - <urlopen error [Errno -3] Temporary failure in name resolution>
18:10:19 ERROR - ERROR - The following files failed: 'linux64-minidump_stackwalk'
18:10:19 ERROR - Return code: 1
Sakari is checking on /etc/resolv.conf in the Bitbar containers, but I wonder if there is some issue with tooltool.mozilla-releng.net / tooltool.mozilla-releng.net.herokudns.com (54.174.228.92)
$ getent hosts tooltool.mozilla-releng.net
54.209.64.71 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
52.4.75.11 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.152.208.69 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.164.206.44 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.165.51.142 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.172.170.160 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.173.32.212 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.174.228.92 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
:dividehex, :fubar: Any thoughts on if this is our issue or something related to Bitbar's DNS setup?
Flags: needinfo?(klibby)
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 10•7 years ago
|
||
I think I found the point where the change caused the problems: Bug 1487798 comment 11
See Also: → 1487798
Assignee | ||
Comment 11•7 years ago
|
||
The images in use at Bitbar are 16.04 from 2018-07-03 which might have obsolete libraries which may not support our current TLS version requirements? We are considering creating new images based on 18.04. Any thoughts?
Comment 12•7 years ago
|
||
(In reply to Bob Clary [:bc:] from comment #10)
> I think I found the point where the change caused the problems: Bug 1487798
> comment 11
Interesting; changes to the SSL certificate shouldn't have any bearing on DNS resolution. Something in the back of my mind makes me wonder if that error is actually correct; ISTR seeing something similar ages ago with DXR, I think, where a python library wasn't being completely honest about what had happened... If that's the case, then your comment in #c11 could be the reason.
If that's NOT it, then I'd be looking at potential network issues before DNS.
Flags: needinfo?(klibby)
Assignee | ||
Comment 13•7 years ago
|
||
We've been looking into networking or dns resolution issues in Bitbar's network with no indication of the problem being there.
The images we use at Bitbar were created 2018-07-03 which is only slightly earlier than our in-tree Docker images which were last updated 2018-07-26.
There were also changes related to herokudns.com as well in addition to the cert, weren't there?
Assignee | ||
Comment 14•7 years ago
|
||
The redirections are pretty weird.
HTTP/1.1 302 FOUND
Location: http://tooltool.mozilla-releng.net/docs
HTTP/1.1 302 FOUND
Location: https://tooltool.mozilla-releng.net/docs
HTTP/1.1 301 MOVED PERMANENTLY
Location: http://tooltool.mozilla-releng.net/docs/
HTTP/1.1 302 FOUND
Location: https://tooltool.mozilla-releng.net/docs/
HTTP/1.1 200 OK
Assignee | ||
Comment 15•7 years ago
|
||
Upgrading the image to Ubuntu 18.04 proved to solve the problem.
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=52fc5af1d9328f61981fbbcf9ae3df7dd547fe08&selectedJob=207937869
The g5 errors are due to a problem with them that will require an update and release of mozdevice.
The p2 errors are due to mozsystemmonitor not being updated on pypi.
I'll file bugs on those issues and we should be ready to go later today.
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 16•7 years ago
|
||
Ubuntu 18.04 was not a complete solution. I am seeing this when testing android hardware under load:
<https://treeherder.mozilla.org/logviewer.html#?job_id=208148925&repo=try&lineNumber=1836-1853>. Not sure of the frequency yet, but it is still the URLError: <urlopen error [Errno -3] Temporary failure in name resolution> error (with the patch from Bug 1501802). If the frequency is too high I will probably just bake the host-utils and minidump_stackwalk into the image.
Assignee | ||
Comment 17•7 years ago
|
||
I added the tooltool.py from https://github.com/mozilla/build-tooltool and copies of testing/config/tooltool-manifests/linux64/hostutils.manifest and testing/config/tooltool-manifests/linux64/releng.manifest to populate /builds/tooltool_cache in the image during creation to prevent network access for the hostutils or minidump_stackwalk unless they become out of date.
Assignee | ||
Comment 18•7 years ago
|
||
This specifies /builds/tooltool_cache for android-hw.
Attachment #9020611 -
Flags: review?(jmaher)
Assignee | ||
Comment 19•7 years ago
|
||
These two changes have eliminated the temporary name resolution error since we no longer hit the network for host-utils or linux64-minidump_stackwalk.
<https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&tochange=19852c0d13e2d1d5acf232f08439e64dfac68cc5&fromchange=69bfd93defbfdb0af9b8ed341504c623247a149a>
The first run is with the regular population of devices we have been running with for some time. It initially had an issue where I had left libcurl3 out which was required for linux64-minidump_stackwalk from tooltool. I fixed that during the run.
The second run is with the new motog5 and pixel2 devices.
No URLError: <urlopen error [Errno -3] Temporary failure in name resolution> errors.
The majority of the remaining errors are due to jit failures which are product issue and not test infra related.
The others are ADBTimeoutErrors, raptor-main TEST-UNEXPECTED-FAIL: no raptor test results were found on motog5 or other intermittent failures.
Once this patch lands we can call this fixed.
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Attachment #9020611 -
Flags: review?(jmaher) → review+
Assignee | ||
Comment 21•7 years ago
|
||
Comment 22•7 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox65:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Assignee | ||
Comment 23•7 years ago
|
||
We still see this on beta where attachment 9020611 [details] [diff] [review] is missing. Please check it into beta.
Whiteboard: [stockwell infra] → [stockwell infra][checkin-needed-beta]
Comment 24•7 years ago
|
||
bugherder uplift |
status-firefox64:
--- → fixed
Whiteboard: [stockwell infra][checkin-needed-beta] → [stockwell infra]
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•