Closed
Bug 1486071
Opened 3 years ago
Closed 2 years ago
errors connecting to the debian package snapshot server
Categories
(Firefox Build System :: Task Configuration, task)
Firefox Build System
Task Configuration
Tracking
(firefox-esr60 fixed, firefox65 fixed)
RESOLVED
FIXED
mozilla65
People
(Reporter: apavel, Assigned: glandium)
References
Details
Attachments
(1 file, 1 obsolete file)
Treeherder link: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=195780959 Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=195781915&repo=autoland&lineNumber=207 [task 2018-08-24T17:19:27.272Z] Ign:10 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 zip amd64 3.0-11+b1 [task 2018-08-24T17:19:27.272Z] Err:1 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libssl1.0.2 amd64 1.0.2l-2 [task 2018-08-24T17:19:27.272Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.273Z] Err:2 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 bzip2 amd64 1.0.6-8.1 [task 2018-08-24T17:19:27.273Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.273Z] Err:3 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libbsd0 amd64 0.8.3-1 [task 2018-08-24T17:19:27.273Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.273Z] Err:4 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libncurses5 amd64 6.0+20161126-1 [task 2018-08-24T17:19:27.273Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.273Z] Err:5 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libedit2 amd64 3.1-20160903-3 [task 2018-08-24T17:19:27.273Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.274Z] Err:6 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 openssh-client amd64 1:7.4p1-10+deb9u1 [task 2018-08-24T17:19:27.274Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.274Z] Err:7 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libcurl3 amd64 7.52.1-5 [task 2018-08-24T17:19:27.274Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.274Z] Err:8 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 curl amd64 7.52.1-5 [task 2018-08-24T17:19:27.274Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.274Z] Err:9 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 unzip amd64 6.0-21 [task 2018-08-24T17:19:27.274Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.274Z] Err:10 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 zip amd64 3.0-11+b1 [task 2018-08-24T17:19:27.274Z] Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/o/openssl1.0/libssl1.0.2_1.0.2l-2_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/b/bzip2/bzip2_1.0.6-8.1_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/libb/libbsd/libbsd0_0.8.3-1_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/n/ncurses/libncurses5_6.0+20161126-1_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/libe/libedit/libedit2_3.1-20160903-3_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/o/openssh/openssh-client_7.4p1-10+deb9u1_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/c/curl/libcurl3_7.52.1-5_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/c/curl/curl_7.52.1-5_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/u/unzip/unzip_6.0-21_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/z/zip/zip_3.0-11+b1_amd64.deb Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80] [task 2018-08-24T17:19:27.276Z] E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing? [task 2018-08-24T17:19:27.361Z] Traceback (most recent call last): [task 2018-08-24T17:19:27.361Z] File "/builds/worker/checkouts/gecko/taskcluster/mach_commands.py", line 468, in build_image [task 2018-08-24T17:19:27.361Z] build_image(image_name, tag, os.environ) [task 2018-08-24T17:19:27.361Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/docker.py", line 83, in build_image [task 2018-08-24T17:19:27.361Z] docker.post_to_docker(buf.getvalue(), '/build', nocache=1, t=tag) [task 2018-08-24T17:19:27.361Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/util/docker.py", line 107, in post_to_docker [task 2018-08-24T17:19:27.362Z] raise Exception(data['error']) [task 2018-08-24T17:19:27.362Z] Exception: The command [/bin/sh -c apt-get update && apt-get install bzip2 curl git gzip openssh-client unzip zip && apt-get clean] returned a non-zero code: 100
Comment 1•3 years ago
|
||
This is almost certainly an intermittent service availability issue. Nothing much we can do except for wait and retry. As such, this bug should be resolved once things start working again. Longer term, bug 1461802 tracks weening off this 3rd party service dependency or at least mitigating our exposure to it.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 5•3 years ago
|
||
Comment 6•3 years ago
|
||
Tom, do you want to land this? It's causing lots of docker images to fail..
Assignee: nobody → mozilla
Comment 7•3 years ago
|
||
I'd be happy to, but I don't have r+.
Comment 8•3 years ago
|
||
Greg, can you take a look? Wander, do we do anything special with docker-worker images to enable or disable v6? I think the networking setup is unchanged from the standard base image, right?
Flags: needinfo?(wcosta)
Flags: needinfo?(gps)
Comment 9•3 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #8) > Greg, can you take a look? Wander, do we do anything special with > docker-worker images to enable or disable v6? I think the networking setup > is unchanged from the standard base image, right? Yes, https://github.com/taskcluster/docker-worker/blob/master/deploy/template/etc/docker/docker.json#L2
Flags: needinfo?(wcosta)
Comment 10•3 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #9) > (In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #8) > > Greg, can you take a look? Wander, do we do anything special with > > docker-worker images to enable or disable v6? I think the networking setup > > is unchanged from the standard base image, right? > > Yes, > https://github.com/taskcluster/docker-worker/blob/master/deploy/template/etc/ > docker/docker.json#L2 I mean, no. This config file is written at the app image.
Assignee | ||
Comment 11•3 years ago
|
||
Why are we explicitly enabling ipv6 in docker if it doesn't work on EC2?
Comment 12•3 years ago
|
||
From bug 1441557, it looks like there are tests in Firefox that require that it be enabled at the kernel level, but do not require routability. This is a pretty common arrangement. For example on my laptop here at Puzzles Cafe (which does not provide IPv6): dustin@jemison ~ $ ping6 arin.net connect: Network is unreachable
Updated•3 years ago
|
Flags: needinfo?(gps)
Comment 13•3 years ago
|
||
Retriggers in https://treeherder.mozilla.org/#/jobs?repo=try&revision=c0984659e0a36ff4afbf7973ce7032b8cb5fa28a are looking good..
Assignee | ||
Comment 14•3 years ago
|
||
So, since I had both a loaner with access to snapshot.debian.org and another without, I gathered some interesting data: - Both IPv4 addresses for snapshot.debian.org are tried before IPv6 is tried and the failure happens. They both are unreachable when one is, or both reachable when one is. - Comparing between both loaners and from my own machine at home, traceroute stops at the last router before both hosts, so it would seem plausible that both hosts have some (synchronized) filtering. - The public IPv4 of the loaner with access is 54.225.3.154, while the public IPv4 of the one without is 18.212.240.147 I went through all the logs in https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-07-04&endday=2018-11-01&tree=all&page=1&page_size=50&bug=1486071 and apart from the fact that some were, in fact, not related to this bug at all, the ones that did have a problem connecting to snapshot.debian.org had something clearly in common. Here's the list of IPv4s: 18.144.32.96 18.144.63.37 18.206.136.206 18.207.181.113 18.212.146.108 18.212.202.120 18.212.204.30 18.232.99.231 18.236.139.10 18.236.215.218 They're all in 18.128.0.0/9
Assignee | ||
Comment 15•3 years ago
|
||
Opened a bug against snapshot.debian.org: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912524
Assignee | ||
Comment 16•3 years ago
|
||
As a mitigation, is it possible to limit the instances we launch in a way that makes them never use that IP block? Other idea: enable ipv6 (because in fact, AWS *does* support ipv6)
Comment 17•3 years ago
|
||
96.32.144.18.in-addr.arpa domain name pointer ec2-18-144-32-96.us-west-1.compute.amazonaws.com. 37.63.144.18.in-addr.arpa domain name pointer ec2-18-144-63-37.us-west-1.compute.amazonaws.com. { "ip_prefix": "18.144.0.0/15", "region": "us-west-1", "service": "EC2" }, 10.139.236.18.in-addr.arpa domain name pointer ec2-18-236-139-10.us-west-2.compute.amazonaws.com. { "ip_prefix": "18.236.0.0/15", "region": "us-west-2", "service": "EC2" }, 206.136.206.18.in-addr.arpa domain name pointer ec2-18-206-136-206.compute-1.amazonaws.com. 113.181.207.18.in-addr.arpa domain name pointer ec2-18-207-181-113.compute-1.amazonaws.com. { "ip_prefix": "18.204.0.0/14", "region": "us-east-1", "service": "EC2" }, 108.146.212.18.in-addr.arpa domain name pointer ec2-18-212-146-108.compute-1.amazonaws.com. 120.202.212.18.in-addr.arpa domain name pointer ec2-18-212-202-120.compute-1.amazonaws.com. 30.204.212.18.in-addr.arpa domain name pointer ec2-18-212-204-30.compute-1.amazonaws.com. { "ip_prefix": "18.208.0.0/13", "region": "us-east-1", "service": "EC2" }, 231.99.232.18.in-addr.arpa domain name pointer ec2-18-232-99-231.compute-1.amazonaws.com. { "ip_prefix": "18.232.0.0/14", "region": "us-east-1", "service": "EC2" }, so, it doesn't seem like that block is specific to a region (apparently `compute-1` is the original region, us-east-1). I would guess that eu-central-1 has IPs in that block, too -- we're just not running instances there to see them. Also, per https://bgpview.io/prefix/18.128.0.0/9 note that this netblock has only been registered for four months and one day. So an outdated bogon list (i.e., someone not using https://www.team-cymru.com/bogon-reference-bgp.html) would blackhole this whole netblock.
Comment 18•3 years ago
|
||
Right, and the key detail to make comment 17 answer your question is, the only way we could control public IPs is to select regions.
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Attachment #9013479 -
Attachment is obsolete: true
Assignee | ||
Comment 20•3 years ago
|
||
So here's a random idea for a "workaround", until we can figure out something more long term: If apt fails, get the public IP of the current worker (http://169.254.169.254/latest/meta-data/public-ipv4 ?), and if it's in the block 18.128.0.0/9, we exit the job with a specific error number that we configure as a retry-exit-status. My worry here is that the taskcluster documentation for retries suggests workers to retry themselves, which would be pointless in this case where we want to use a different worker.
Comment 21•3 years ago
|
||
The retries are queue-level retries (it resolves the run with reason 'intermittent-task'). So it will go back in the queue and be picked up by whatever worker claims it. It sounds like a pretty good workaround to me! I can fix up that doc if you point me to it.
Assignee | ||
Comment 22•3 years ago
|
||
https://docs.taskcluster.net/docs/reference/platform/taskcluster-queue/docs/worker-interaction#intermittent-tasks Second paragraph: Reporting a task exception with reason intermittent-task will retry the task if retries haven't been exhausted. It is strongly encouraged that workers retry the task/run it already holds, rather than resolving the task and have the queue retry the task.
Assignee | ||
Comment 23•3 years ago
|
||
Is there something better than http://169.254.169.254/latest/meta-data/public-ipv4 to get the public ip?
Comment 24•3 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #23) > Is there something better than > http://169.254.169.254/latest/meta-data/public-ipv4 to get the public ip? I think that's adequate for the moment and more reliable than any external service. https://github.com/taskcluster/taskcluster-queue/pull/298 for the docs.
Comment 25•3 years ago
|
||
Baking in the assumption that tasks run in EC2 scares me a little. Especially since GCP is apparently in our future. And making a network request as part of the task also scares me: network requests are intrinsically unreliable. The logs for tasks already log the public IP address of the worker. Could we get that IP string exposed via an environment variable so we can avoid the IP address lookup? Or perhaps it already is?
Assignee | ||
Comment 26•3 years ago
|
||
Note that 169.254.169.254 is not on a real network.
Comment 27•3 years ago
|
||
I think this would be a short-term workaround. If snapshot permanently isn't available from the whole Internet, then I think we need to stop using it. The IP in question here is local to the host (it's the EC2 metadata service).
Assignee | ||
Comment 28•3 years ago
|
||
We clearly need some cache (bug 1461802)
Assignee | ||
Comment 29•3 years ago
|
||
That said, I just checked on a loaner, and the IP is available in the environment: TASKCLUSTER_PUBLIC_IP
Assignee | ||
Comment 30•3 years ago
|
||
When apt-get fails, it has a distinctive error code (100). Most of the time, when apt-get fails, it's because of some network error, or possibly some problem unpacking archives. When that happens, retrying the task usually "fixes" the issue. One of the (currently) most common causes of problems is snapshot.debian.org not being available to some of the EC2 instances. It would be possible to only set things up so that we only retry when we detect such setup (checking the public IP of the instance is not in the known list of problematic IPs), but that would require possibly wrapping apt-get, or something along those line, which is not entirely trivial to do for the packages tasks, because they don't rely on docker images. However, since there aren't many apt-get failures other than these, and since there have been, historically, some intermittent apt-get failures of a different nature that were solved by re-running the tasks, it seems fair to just retry wheneven apt-get fails. One downside of the approach is that if for some reason a change to a Dockerfile ends up mentioning a package that doesn't exist, that too will result in multiple retries ; which might be inconvenient, but that's not something that's going to happen often.
Assignee | ||
Updated•3 years ago
|
Assignee: mozilla → mh+mozilla
Assignee | ||
Comment 31•3 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=057bf1130e46e1f65e4320f1445b415f6b0094d4
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 36•2 years ago
|
||
Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/7c0c8f3df698 Retry docker-image and packages tasks that fail during apt-get. r=dustin
Comment 37•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/7c0c8f3df698
Status: NEW → RESOLVED
Closed: 2 years ago
status-firefox65:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Comment 38•2 years ago
|
||
bugherderuplift |
https://hg.mozilla.org/releases/mozilla-esr60/rev/20f90f3014dd
status-firefox-esr60:
--- → fixed
You need to log in
before you can comment on or make changes to this bug.
Description
•