Closed Bug 1499246 Opened Last year Closed Last year
ERROR - The following files failed: 'host-utils-61
.0a1 .en-US .linux-x86 _64 .tar .gz | ERROR - The following files failed: 'linux64-minidump _stackwalk'
3.64 KB, text/plain
18.79 KB, application/zip
1.10 KB, patch
|Details | Diff | Splinter Review|
Beginning at Tue, Oct 9, 12:15:26 with https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed,busted,exception,runnable&tier=1,2,3&group_state=expanded&searchStr=android-hw&revision=4a92019d1ad6be91a38aed8033b5b85d1654dd78 the android-hw tests began to experience frequent intermittent failures to fetch files on try jobs which are most frequently run. https://treeherder.mozilla.org/logviewer.html#?job_id=204314059&repo=try&lineNumber=795 The first mozilla-central job where this occurred was at Tue, Oct 9, 09:24:03 with https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=1%2C2%2C3&group_state=expanded&searchStr=android-hw&selectedJob=204295629&revision=e96bcfe8669abdb7eaa9f034daba53d44d8c3e51 The first autoland job where this occured was at Tue, Oct 9, 09:29:58 with https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=1%2C2%2C3&group_state=expanded&searchStr=android-hw&selectedJob=204270932&revision=1c93105605f888686118d498efacc64e92080a63 The first mozilla-inbound job where this occurred was at Tue, Oct 9, 08:04:39 with https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=1%2C2%2C3&group_state=expanded&searchStr=android-hw&selectedJob=204284415&revision=d49a5d674e007fc79beda9437889ca5a1eec4aa2
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
bug 1501364 disabled android-hw on trunk repos including try though beta was left enabled and many try pushes continued to run due to stale trees from developers. While testing android-hw to see if we could re-enable it I pushed two try pushes with --rebuild 20 for android-hw tests: <https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&group_state=expanded&fromchange=800c199f3f27deb30b1677f775fdd7995fd56b43&tochange=bcf2ac55096d9c48067938882dafa1fe96d2dd22> This download error was much less frequent though it did still occur. https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#713 sets up the parameters for retrying the download. If we fail to get the file one or more times this causes the ERROR - The following files failed: message to be emitted even though we may actually get the file. If this were a mere warning, it would not cause the job to go orange. Perhaps we can relax that. It doesn't appear from the logs that we are getting one of the retry exceptions but instead are getting a non zero status code but do not log it. I'll try a patch to see what status code is actually being returned. Another option to work around this for bitbar is to just bake the hostutils and minidump_stackwalk into the bitbar container and not attempt to download them each time.
I made a slight change to log the exception when tooltool.py's fetch fails and found https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&group_state=expanded&fromchange=1f676fd17e982396b59472f9746a48e173a37575&selectedJob=207586835 https://treeherder.mozilla.org/logviewer.html#?job_id=207586835&repo=try&lineNumber=800 18:10:09 INFO - INFO - Attempting to fetch from 'https://tooltool.mozilla-releng.net/'... 18:10:19 INFO - INFO - ...failed to fetch 'linux64-minidump_stackwalk' from https://tooltool.mozilla-releng.net/ 18:10:19 INFO - INFO - <urlopen error [Errno -3] Temporary failure in name resolution> 18:10:19 ERROR - ERROR - The following files failed: 'linux64-minidump_stackwalk' 18:10:19 ERROR - Return code: 1 Sakari is checking on /etc/resolv.conf in the Bitbar containers, but I wonder if there is some issue with tooltool.mozilla-releng.net / tooltool.mozilla-releng.net.herokudns.com (18.104.22.168) $ getent hosts tooltool.mozilla-releng.net 22.214.171.124 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net 126.96.36.199 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net 188.8.131.52 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net 184.108.40.206 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net 220.127.116.11 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net 18.104.22.168 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net 22.214.171.124 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net 126.96.36.199 tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net :dividehex, :fubar: Any thoughts on if this is our issue or something related to Bitbar's DNS setup?
I think I found the point where the change caused the problems: Bug 1487798 comment 11
See Also: → 1487798
The images in use at Bitbar are 16.04 from 2018-07-03 which might have obsolete libraries which may not support our current TLS version requirements? We are considering creating new images based on 18.04. Any thoughts?
(In reply to Bob Clary [:bc:] from comment #10) > I think I found the point where the change caused the problems: Bug 1487798 > comment 11 Interesting; changes to the SSL certificate shouldn't have any bearing on DNS resolution. Something in the back of my mind makes me wonder if that error is actually correct; ISTR seeing something similar ages ago with DXR, I think, where a python library wasn't being completely honest about what had happened... If that's the case, then your comment in #c11 could be the reason. If that's NOT it, then I'd be looking at potential network issues before DNS.
We've been looking into networking or dns resolution issues in Bitbar's network with no indication of the problem being there. The images we use at Bitbar were created 2018-07-03 which is only slightly earlier than our in-tree Docker images which were last updated 2018-07-26. There were also changes related to herokudns.com as well in addition to the cert, weren't there?
The redirections are pretty weird. HTTP/1.1 302 FOUND Location: http://tooltool.mozilla-releng.net/docs HTTP/1.1 302 FOUND Location: https://tooltool.mozilla-releng.net/docs HTTP/1.1 301 MOVED PERMANENTLY Location: http://tooltool.mozilla-releng.net/docs/ HTTP/1.1 302 FOUND Location: https://tooltool.mozilla-releng.net/docs/ HTTP/1.1 200 OK
Upgrading the image to Ubuntu 18.04 proved to solve the problem. https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=52fc5af1d9328f61981fbbcf9ae3df7dd547fe08&selectedJob=207937869 The g5 errors are due to a problem with them that will require an update and release of mozdevice. The p2 errors are due to mozsystemmonitor not being updated on pypi. I'll file bugs on those issues and we should be ready to go later today.
Ubuntu 18.04 was not a complete solution. I am seeing this when testing android hardware under load: <https://treeherder.mozilla.org/logviewer.html#?job_id=208148925&repo=try&lineNumber=1836-1853>. Not sure of the frequency yet, but it is still the URLError: <urlopen error [Errno -3] Temporary failure in name resolution> error (with the patch from Bug 1501802). If the frequency is too high I will probably just bake the host-utils and minidump_stackwalk into the image.
I added the tooltool.py from https://github.com/mozilla/build-tooltool and copies of testing/config/tooltool-manifests/linux64/hostutils.manifest and testing/config/tooltool-manifests/linux64/releng.manifest to populate /builds/tooltool_cache in the image during creation to prevent network access for the hostutils or minidump_stackwalk unless they become out of date.
This specifies /builds/tooltool_cache for android-hw.
Attachment #9020611 - Flags: review?(jmaher)
These two changes have eliminated the temporary name resolution error since we no longer hit the network for host-utils or linux64-minidump_stackwalk. <https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&tochange=19852c0d13e2d1d5acf232f08439e64dfac68cc5&fromchange=69bfd93defbfdb0af9b8ed341504c623247a149a> The first run is with the regular population of devices we have been running with for some time. It initially had an issue where I had left libcurl3 out which was required for linux64-minidump_stackwalk from tooltool. I fixed that during the run. The second run is with the new motog5 and pixel2 devices. No URLError: <urlopen error [Errno -3] Temporary failure in name resolution> errors. The majority of the remaining errors are due to jit failures which are product issue and not test infra related. The others are ADBTimeoutErrors, raptor-main TEST-UNEXPECTED-FAIL: no raptor test results were found on motog5 or other intermittent failures. Once this patch lands we can call this fixed.
Attachment #9020611 - Flags: review?(jmaher) → review+
We still see this on beta where attachment 9020611 [details] [diff] [review] is missing. Please check it into beta.
Whiteboard: [stockwell infra] → [stockwell infra][checkin-needed-beta]
You need to log in before you can comment on or make changes to this bug.