Closed Bug 1486071 Opened 3 years ago Closed 2 years ago

errors connecting to the debian package snapshot server

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(firefox-esr60 fixed, firefox65 fixed)

RESOLVED FIXED
mozilla65
Tracking Status
firefox-esr60 --- fixed
firefox65 --- fixed

People

(Reporter: apavel, Assigned: glandium)

References

Details

Attachments

(1 file, 1 obsolete file)

Treeherder link: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=195780959

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=195781915&repo=autoland&lineNumber=207

[task 2018-08-24T17:19:27.272Z] Ign:10 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 zip amd64 3.0-11+b1
[task 2018-08-24T17:19:27.272Z] Err:1 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libssl1.0.2 amd64 1.0.2l-2
[task 2018-08-24T17:19:27.272Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.273Z] Err:2 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 bzip2 amd64 1.0.6-8.1
[task 2018-08-24T17:19:27.273Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.273Z] Err:3 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libbsd0 amd64 0.8.3-1
[task 2018-08-24T17:19:27.273Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.273Z] Err:4 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libncurses5 amd64 6.0+20161126-1
[task 2018-08-24T17:19:27.273Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.273Z] Err:5 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libedit2 amd64 3.1-20160903-3
[task 2018-08-24T17:19:27.273Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.274Z] Err:6 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 openssh-client amd64 1:7.4p1-10+deb9u1
[task 2018-08-24T17:19:27.274Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.274Z] Err:7 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 libcurl3 amd64 7.52.1-5
[task 2018-08-24T17:19:27.274Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.274Z] Err:8 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 curl amd64 7.52.1-5
[task 2018-08-24T17:19:27.274Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.274Z] Err:9 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 unzip amd64 6.0-21
[task 2018-08-24T17:19:27.274Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.274Z] Err:10 http://snapshot.debian.org/archive/debian/20170830T000511Z stretch/main amd64 zip amd64 3.0-11+b1
[task 2018-08-24T17:19:27.274Z]   Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/o/openssl1.0/libssl1.0.2_1.0.2l-2_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/b/bzip2/bzip2_1.0.6-8.1_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/libb/libbsd/libbsd0_0.8.3-1_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/n/ncurses/libncurses5_6.0+20161126-1_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/libe/libedit/libedit2_3.1-20160903-3_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/o/openssh/openssh-client_7.4p1-10+deb9u1_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/c/curl/libcurl3_7.52.1-5_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/c/curl/curl_7.52.1-5_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/u/unzip/unzip_6.0-21_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Failed to fetch http://snapshot.debian.org/archive/debian/20170830T000511Z/pool/main/z/zip/zip_3.0-11+b1_amd64.deb  Cannot initiate the connection to snapshot.debian.org:80 (2001:1af8:4020:b030:deb::185). - connect (101: Network is unreachable) [IP: 2001:1af8:4020:b030:deb::185 80]
[task 2018-08-24T17:19:27.276Z] E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
[task 2018-08-24T17:19:27.361Z] Traceback (most recent call last):
[task 2018-08-24T17:19:27.361Z]   File "/builds/worker/checkouts/gecko/taskcluster/mach_commands.py", line 468, in build_image
[task 2018-08-24T17:19:27.361Z]     build_image(image_name, tag, os.environ)
[task 2018-08-24T17:19:27.361Z]   File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/docker.py", line 83, in build_image
[task 2018-08-24T17:19:27.361Z]     docker.post_to_docker(buf.getvalue(), '/build', nocache=1, t=tag)
[task 2018-08-24T17:19:27.361Z]   File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/util/docker.py", line 107, in post_to_docker
[task 2018-08-24T17:19:27.362Z]     raise Exception(data['error'])
[task 2018-08-24T17:19:27.362Z] Exception: The command [/bin/sh -c apt-get update &&     apt-get install       bzip2       curl       git       gzip       openssh-client       unzip       zip &&     apt-get clean] returned a non-zero code: 100
This is almost certainly an intermittent service availability issue. Nothing much we can do except for wait and retry. As such, this bug should be resolved once things start working again.

Longer term, bug 1461802 tracks weening off this 3rd party service dependency or at least mitigating our exposure to it.
Tom, do you want to land this?  It's causing lots of docker images to fail..
Assignee: nobody → mozilla
I'd be happy to, but I don't have r+.
Greg, can you take a look?  Wander, do we do anything special with docker-worker images to enable or disable v6?  I think the networking setup is unchanged from the standard base image, right?
Flags: needinfo?(wcosta)
Flags: needinfo?(gps)
No longer blocks: 1432390
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #8)
> Greg, can you take a look?  Wander, do we do anything special with
> docker-worker images to enable or disable v6?  I think the networking setup
> is unchanged from the standard base image, right?

Yes, https://github.com/taskcluster/docker-worker/blob/master/deploy/template/etc/docker/docker.json#L2
Flags: needinfo?(wcosta)
(In reply to Wander Lairson Costa [:wcosta] from comment #9)
> (In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #8)
> > Greg, can you take a look?  Wander, do we do anything special with
> > docker-worker images to enable or disable v6?  I think the networking setup
> > is unchanged from the standard base image, right?
> 
> Yes,
> https://github.com/taskcluster/docker-worker/blob/master/deploy/template/etc/
> docker/docker.json#L2

I mean, no. This config file is written at the app image.
Why are we explicitly enabling ipv6 in docker if it doesn't work on EC2?
From bug 1441557, it looks like there are tests in Firefox that require that it be enabled at the kernel level, but do not require routability.

This is a pretty common arrangement.  For example on my laptop here at Puzzles Cafe (which does not provide IPv6):

dustin@jemison ~ $ ping6 arin.net
connect: Network is unreachable
Flags: needinfo?(gps)
So, since I had both a loaner with access to snapshot.debian.org and another without, I gathered some interesting data:
- Both IPv4 addresses for snapshot.debian.org are tried before IPv6 is tried and the failure happens. They both are unreachable when one is, or both reachable when one is.
- Comparing between both loaners and from my own machine at home, traceroute stops at the last router before both hosts, so it would seem plausible that both hosts have some (synchronized) filtering.
- The public IPv4 of the loaner with access is 54.225.3.154, while the public IPv4 of the one without is 18.212.240.147

I went through all the logs in https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-07-04&endday=2018-11-01&tree=all&page=1&page_size=50&bug=1486071 and apart from the fact that some were, in fact, not related to this bug at all, the ones that did have a problem connecting to snapshot.debian.org had something clearly in common. Here's the list of IPv4s:

18.144.32.96
18.144.63.37
18.206.136.206
18.207.181.113
18.212.146.108
18.212.202.120
18.212.204.30
18.232.99.231
18.236.139.10
18.236.215.218

They're all in 18.128.0.0/9
Opened a bug against snapshot.debian.org:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=912524
As a mitigation, is it possible to limit the instances we launch in a way that makes them never use that IP block? Other idea: enable ipv6 (because in fact, AWS *does* support ipv6)
96.32.144.18.in-addr.arpa domain name pointer ec2-18-144-32-96.us-west-1.compute.amazonaws.com.
37.63.144.18.in-addr.arpa domain name pointer ec2-18-144-63-37.us-west-1.compute.amazonaws.com.
    {
      "ip_prefix": "18.144.0.0/15",
      "region": "us-west-1",
      "service": "EC2"
    },

10.139.236.18.in-addr.arpa domain name pointer ec2-18-236-139-10.us-west-2.compute.amazonaws.com.
    {
      "ip_prefix": "18.236.0.0/15",
      "region": "us-west-2",
      "service": "EC2"
    },

206.136.206.18.in-addr.arpa domain name pointer ec2-18-206-136-206.compute-1.amazonaws.com.
113.181.207.18.in-addr.arpa domain name pointer ec2-18-207-181-113.compute-1.amazonaws.com.
    {
      "ip_prefix": "18.204.0.0/14",
      "region": "us-east-1",
      "service": "EC2"
    },

108.146.212.18.in-addr.arpa domain name pointer ec2-18-212-146-108.compute-1.amazonaws.com.
120.202.212.18.in-addr.arpa domain name pointer ec2-18-212-202-120.compute-1.amazonaws.com.
30.204.212.18.in-addr.arpa domain name pointer ec2-18-212-204-30.compute-1.amazonaws.com.
    {
      "ip_prefix": "18.208.0.0/13",
      "region": "us-east-1",
      "service": "EC2"
    },

231.99.232.18.in-addr.arpa domain name pointer ec2-18-232-99-231.compute-1.amazonaws.com.
    {
      "ip_prefix": "18.232.0.0/14",
      "region": "us-east-1",
      "service": "EC2"
    },

so, it doesn't seem like that block is specific to a region (apparently `compute-1` is the original region, us-east-1).  I would guess that eu-central-1 has IPs in that block, too -- we're just not running instances there to see them.

Also, per https://bgpview.io/prefix/18.128.0.0/9 note that this netblock has only been registered for four months and one day.  So an outdated bogon list (i.e., someone not using https://www.team-cymru.com/bogon-reference-bgp.html) would blackhole this whole netblock.
Right, and the key detail to make comment 17 answer your question is, the only way we could control public IPs is to select regions.
Attachment #9013479 - Attachment is obsolete: true
So here's a random idea for a "workaround", until we can figure out something more long term:
If apt fails, get the public IP of the current worker (http://169.254.169.254/latest/meta-data/public-ipv4 ?), and if it's in the block 18.128.0.0/9, we exit the job with a specific error number that we configure as a retry-exit-status. My worry here is that the taskcluster documentation for retries suggests workers to retry themselves, which would be pointless in this case where we want to use a different worker.
The retries are queue-level retries (it resolves the run with reason 'intermittent-task').  So it will go back in the queue and be picked up by whatever worker claims it.

It sounds like a pretty good workaround to me!

I can fix up that doc if you point me to it.
https://docs.taskcluster.net/docs/reference/platform/taskcluster-queue/docs/worker-interaction#intermittent-tasks
Second paragraph:

   Reporting a task exception with reason intermittent-task will retry the task
   if retries haven't been exhausted. It is strongly encouraged that workers
   retry the task/run it already holds, rather than resolving the task and have
   the queue retry the task.
Is there something better than http://169.254.169.254/latest/meta-data/public-ipv4 to get the public ip?
(In reply to Mike Hommey [:glandium] from comment #23)
> Is there something better than
> http://169.254.169.254/latest/meta-data/public-ipv4 to get the public ip?

I think that's adequate for the moment and more reliable than any external service.

https://github.com/taskcluster/taskcluster-queue/pull/298 for the docs.
Baking in the assumption that tasks run in EC2 scares me a little. Especially since GCP is apparently in our future. And making a network request as part of the task also scares me: network requests are intrinsically unreliable.

The logs for tasks already log the public IP address of the worker. Could we get that IP string exposed via an environment variable so we can avoid the IP address lookup? Or perhaps it already is?
Note that 169.254.169.254 is not on a real network.
I think this would be a short-term workaround.  If snapshot permanently isn't available from the whole Internet, then I think we need to stop using it.  The IP in question here is local to the host (it's the EC2 metadata service).
We clearly need some cache (bug 1461802)
That said, I just checked on a loaner, and the IP is available in the environment: TASKCLUSTER_PUBLIC_IP
When apt-get fails, it has a distinctive error code (100). Most of the
time, when apt-get fails, it's because of some network error, or
possibly some problem unpacking archives. When that happens, retrying
the task usually "fixes" the issue.

One of the (currently) most common causes of problems is
snapshot.debian.org not being available to some of the EC2 instances.

It would be possible to only set things up so that we only retry when we
detect such setup (checking the public IP of the instance is not in the
known list of problematic IPs), but that would require possibly wrapping
apt-get, or something along those line, which is not entirely trivial to
do for the packages tasks, because they don't rely on docker images.

However, since there aren't many apt-get failures other than these,
and since there have been, historically, some intermittent apt-get
failures of a different nature that were solved by re-running the tasks,
it seems fair to just retry wheneven apt-get fails.

One downside of the approach is that if for some reason a change to a
Dockerfile ends up mentioning a package that doesn't exist, that too
will result in multiple retries ; which might be inconvenient, but
that's not something that's going to happen often.
Assignee: mozilla → mh+mozilla
Duplicate of this bug: 1506353
Duplicate of this bug: 1506354
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/7c0c8f3df698
Retry docker-image and packages tasks that fail during apt-get. r=dustin
https://hg.mozilla.org/mozilla-central/rev/7c0c8f3df698
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Blocks: 1532893
You need to log in before you can comment on or make changes to this bug.