Closed Bug 1499246 Opened 6 years ago Closed 6 years ago

ERROR - The following files failed: 'host-utils-61.0a1.en-US.linux-x86_64.tar.gz | ERROR - The following files failed: 'linux64-minidump_stackwalk'

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(firefox64 fixed, firefox65 fixed)

RESOLVED FIXED
mozilla65
Tracking Status
firefox64 --- fixed
firefox65 --- fixed

People

(Reporter: bc, Assigned: bc)

References

Details

(Whiteboard: [stockwell infra])

Attachments

(3 files)

Whiteboard: [stockwell disable-recommended] → [stockwell infra]
Blocks: 1501364
bug 1501364 disabled android-hw on trunk repos including try though beta was left enabled and many try pushes continued to run due to stale trees from developers.

While testing android-hw to see if we could re-enable it I pushed two try pushes with --rebuild 20 for android-hw tests:

<https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&group_state=expanded&fromchange=800c199f3f27deb30b1677f775fdd7995fd56b43&tochange=bcf2ac55096d9c48067938882dafa1fe96d2dd22>

This download error was much less frequent though it did still occur.

https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#713 sets up the parameters for retrying the download. If we fail to get the file one or more times this causes the ERROR - The following files failed: message to be emitted even though we may actually get the file. If this were a mere warning, it would not cause the job to go orange. Perhaps we can relax that. It doesn't appear from the logs that we are getting one of the retry exceptions but instead are getting a non zero status code but do not log it. I'll try a patch to see what status code is actually being returned.

Another option to work around this for bitbar is to just bake the hostutils and minidump_stackwalk into the bitbar container and not attempt to download them each time.
I made a slight change to log the exception when tooltool.py's fetch fails and found

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&group_state=expanded&fromchange=1f676fd17e982396b59472f9746a48e173a37575&selectedJob=207586835
https://treeherder.mozilla.org/logviewer.html#?job_id=207586835&repo=try&lineNumber=800

18:10:09     INFO -  INFO - Attempting to fetch from 'https://tooltool.mozilla-releng.net/'...
18:10:19     INFO -  INFO - ...failed to fetch 'linux64-minidump_stackwalk' from https://tooltool.mozilla-releng.net/
18:10:19     INFO -  INFO - <urlopen error [Errno -3] Temporary failure in name resolution>
18:10:19    ERROR -  ERROR - The following files failed: 'linux64-minidump_stackwalk'
18:10:19    ERROR - Return code: 1

Sakari is checking on /etc/resolv.conf in the Bitbar containers, but I wonder if there is some issue with tooltool.mozilla-releng.net / tooltool.mozilla-releng.net.herokudns.com (54.174.228.92)

$ getent hosts tooltool.mozilla-releng.net
54.209.64.71    tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
52.4.75.11      tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.152.208.69   tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.164.206.44   tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.165.51.142   tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.172.170.160  tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.173.32.212   tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net
54.174.228.92   tooltool.mozilla-releng.net.herokudns.com tooltool.mozilla-releng.net

:dividehex, :fubar: Any thoughts on if this is our issue or something related to Bitbar's DNS setup?
Flags: needinfo?(klibby)
Flags: needinfo?(jwatkins)
I think I found the point where the change caused the problems: Bug 1487798 comment 11
See Also: → 1487798
The images in use at Bitbar are 16.04 from 2018-07-03 which might have obsolete libraries which may not support our current TLS version requirements? We are considering creating new images based on 18.04. Any thoughts?
See Also: → 1501802
(In reply to Bob Clary [:bc:] from comment #10)
> I think I found the point where the change caused the problems: Bug 1487798
> comment 11

Interesting; changes to the SSL certificate shouldn't have any bearing on DNS resolution. Something in the back of my mind makes me wonder if that error is actually correct; ISTR seeing something similar ages ago with DXR, I think, where a python library wasn't being completely honest about what had happened...  If that's the case, then your comment in #c11 could be the reason. 

If that's NOT it, then I'd be looking at potential network issues before DNS.
Flags: needinfo?(klibby)
We've been looking into networking or dns resolution issues in Bitbar's network with no indication of the problem being there.

The images we use at Bitbar were created 2018-07-03 which is only slightly earlier than our in-tree Docker images which were last updated 2018-07-26.

There were also changes related to herokudns.com as well in addition to the cert, weren't there?
The redirections are pretty weird.

HTTP/1.1 302 FOUND
Location: http://tooltool.mozilla-releng.net/docs
HTTP/1.1 302 FOUND
Location: https://tooltool.mozilla-releng.net/docs
HTTP/1.1 301 MOVED PERMANENTLY
Location: http://tooltool.mozilla-releng.net/docs/
HTTP/1.1 302 FOUND
Location: https://tooltool.mozilla-releng.net/docs/
HTTP/1.1 200 OK
Upgrading the image to Ubuntu 18.04 proved to solve the problem.

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=52fc5af1d9328f61981fbbcf9ae3df7dd547fe08&selectedJob=207937869

The g5 errors are due to a problem with them that will require an update and release of mozdevice.
The p2 errors are due to mozsystemmonitor not being updated on pypi.

I'll file bugs on those issues and we should be ready to go later today.
Flags: needinfo?(jwatkins)
Ubuntu 18.04 was not a complete solution. I am seeing this when testing android hardware under load:

<https://treeherder.mozilla.org/logviewer.html#?job_id=208148925&repo=try&lineNumber=1836-1853>. Not sure of the frequency yet, but it is still the  URLError: <urlopen error [Errno -3] Temporary failure in name resolution> error (with the patch from Bug 1501802). If the frequency is too high I will probably just bake the host-utils and minidump_stackwalk into the image.
I added the tooltool.py from https://github.com/mozilla/build-tooltool and copies of testing/config/tooltool-manifests/linux64/hostutils.manifest and testing/config/tooltool-manifests/linux64/releng.manifest to populate /builds/tooltool_cache in the image during creation to prevent network access for the hostutils or minidump_stackwalk unless they become out of date.
This specifies /builds/tooltool_cache for android-hw.
Attachment #9020611 - Flags: review?(jmaher)
These two changes have eliminated the temporary name resolution error since we no longer hit the network for host-utils or linux64-minidump_stackwalk.

<https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=bclary%40mozilla.com&tochange=19852c0d13e2d1d5acf232f08439e64dfac68cc5&fromchange=69bfd93defbfdb0af9b8ed341504c623247a149a>

The first run is with the regular population of devices we have been running with for some time. It initially had an issue where I had left libcurl3 out which was required for linux64-minidump_stackwalk from tooltool. I fixed that during the run.

The second run is with the new motog5 and pixel2 devices.

No URLError: <urlopen error [Errno -3] Temporary failure in name resolution> errors.

The majority of the remaining errors are due to jit failures which are product issue and not test infra related.

The others are ADBTimeoutErrors, raptor-main TEST-UNEXPECTED-FAIL: no raptor test results were found on motog5 or other intermittent failures.

Once this patch lands we can call this fixed.
Attachment #9020611 - Flags: review?(jmaher) → review+
https://hg.mozilla.org/mozilla-central/rev/6b83a5ae6c26
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
We still see this on beta where attachment 9020611 [details] [diff] [review] is missing. Please check it into beta.
Whiteboard: [stockwell infra] → [stockwell infra][checkin-needed-beta]
https://hg.mozilla.org/releases/mozilla-beta/rev/bf07e6c3ea33
Whiteboard: [stockwell infra][checkin-needed-beta] → [stockwell infra]
Blocks: 1483695
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: