Closed Bug 1372684 Opened 7 years ago Closed 7 years ago

Vagrant up/provision fails with apt-get DNS errors ("Temporary failure resolving 'us.archive.ubuntu.com'")

Categories

(Tree Management :: Treeherder, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: seban, Assigned: emorley)

Details

Attachments

(1 file, 1 obsolete file)

There are four issues, which I faced during the installation:

1)apt-get update fails to fetch files, “Temporary failure resolving …” error
Solution: Temporarily add a known DNS server to your system. Here 8.8.8.8 is Google's DNS server.
.. SSH into vagrant, execute this command:
.. `echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf > /dev/null`
.. exit SSH, reload vagrant

2) sudo add-apt-repository -y ppa:ondrej/mysql-5.6 failed.
Error Message: Cannot add PPA: 'ppa:~ondrej...                       
               ERROR: '~ondrej' user or team does not exist.
Solution: 
.. SSH into vagrant, execute this command: 
.. `sudo apt-get install --reinstall ca-certificates`
.. exit SSH, vagrant provision

3) Error Message: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY <key>
Solution: Manually add the gpg public key using `sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <key>`

4) On https://treeherder.readthedocs.io/installation.html#starting-a-local-treeherder-instance, Building the UI should be mentioned before starting of the server.

PS: And also update vagrant and virtualbox to the latest versions as recommended.
Than you for filing - I'm always keen to improve the UX/docs for getting set up.

I've just tried a `vagrant destroy -f && vagrant up` locally and don't see any of those errors.

Did your investigation find any possible reason as to why you saw them? Old Vagrant/virtualbox version perhaps? Are you using a proxy server on the host? How reliable is your connection/do you ever have issues connecting to other sites/services?

(In reply to Sebastin Santy [:seban] from comment #0)
> 4) On https://treeherder.readthedocs.io/installation.html#starting-a-local-treeherder-instance
> Building the UI should be mentioned before starting of the server.

The docs already have the steps in that order. ie: they say to start runserver, then run `yarn run start:local`, then visit the page in the browser. Is there a way you think this can be tweaked to be clearer?

> PS: And also update vagrant and virtualbox to the latest versions as recommended.

At the top of the docs it says "Install Git, Virtualbox and Vagrant (latest versions recommended).".
Flags: needinfo?(sebastinssanty)
(In reply to Ed Morley [:emorley] from comment #1)

> Did your investigation find any possible reason as to why you saw them? Old
> Vagrant/virtualbox version perhaps? 

Virtualbox 5.1.22 and Vagrant 1.9.5. I guess, they are the latest.

>Are you using a proxy server on the
> host? How reliable is your connection/do you ever have issues connecting to
> other sites/services?

I might be behind a proxy server, but then I have tried on multiple(3) ISPs and they all were having the same issues. Other sites/services including the downloads in vagrant itself were working fine for me.
 

> > 4) On https://treeherder.readthedocs.io/installation.html#starting-a-local-treeherder-instance
> > Building the UI should be mentioned before starting of the server.
> 
> The docs already have the steps in that order. ie: they say to start
> runserver, then run `yarn run start:local`, then visit the page in the
> browser. Is there a way you think this can be tweaked to be clearer?
> 

A note can be put which points out, that without building the UI, `/` route will show a 404 error. This is especially because the earlier version (of treeherder) didn't have the need to build the UI (AFAIK) and hence this point can be missed by oversight.

> > PS: And also update vagrant and virtualbox to the latest versions as recommended.
> 
> At the top of the docs it says "Install Git, Virtualbox and Vagrant (latest
> versions recommended).".

That is what I meant :)
Flags: needinfo?(sebastinssanty)
(In reply to Sebastin Santy [:seban] from comment #2)
> Virtualbox 5.1.22 and Vagrant 1.9.5. I guess, they are the latest.

Were these the versions that showed the error, or the version you are using now?

> I might be behind a proxy server, but then I have tried on multiple(3) ISPs
> and they all were having the same issues. Other sites/services including the
> downloads in vagrant itself were working fine for me.

Ok to be clearer: do you have proxy settings defined in your host OS? (eg browser/OS/bash environment etc)

> > The docs already have the steps in that order. ie: they say to start
> > runserver, then run `yarn run start:local`, then visit the page in the
> > browser. Is there a way you think this can be tweaked to be clearer?
> > 
> 
> A note can be put which points out, that without building the UI, `/` route
> will show a 404 error. This is especially because the earlier version (of
> treeherder) didn't have the need to build the UI (AFAIK) and hence this
> point can be missed by oversight.

If someone is going to miss the steps, won't they also miss the note about 404? hehe

I've just remembered the workflow for this part is going to change soon in bug 1363722, and for the new workflow missing a step results in a timeout (since no server is running) rather than the default Django page, so I think perhaps that will reduce the potential for confusion.

> > At the top of the docs it says "Install Git, Virtualbox and Vagrant (latest
> > versions recommended).".
> 
> That is what I meant :)

Great :-)
Flags: needinfo?(sebastinssanty)
(In reply to Ed Morley [:emorley] from comment #3)

> > Virtualbox 5.1.22 and Vagrant 1.9.5. I guess, they are the latest.
> 
> Were these the versions that showed the error, or the version you are using
> now?

These are the versions I am using now, and which showed the error too.

> > I might be behind a proxy server, but then I have tried on multiple(3) ISPs
> > and they all were having the same issues. Other sites/services including the
> > downloads in vagrant itself were working fine for me.
> 
> Ok to be clearer: do you have proxy settings defined in your host OS? (eg
> browser/OS/bash environment etc)

Using macOS. I don't have any proxy settings defined on it (AFAIK from my network preferences).

> > > The docs already have the steps in that order. ie: they say to start
> > > runserver, then run `yarn run start:local`, then visit the page in the
> > > browser. Is there a way you think this can be tweaked to be clearer?
> > > 
> > 
> > A note can be put which points out, that without building the UI, `/` route
> > will show a 404 error. This is especially because the earlier version (of
> > treeherder) didn't have the need to build the UI (AFAIK) and hence this
> > point can be missed by oversight.
> 
> If someone is going to miss the steps, won't they also miss the note about
> 404? hehe
> 
> I've just remembered the workflow for this part is going to change soon in
> bug 1363722, and for the new workflow missing a step results in a timeout
> (since no server is running) rather than the default Django page, so I think
> perhaps that will reduce the potential for confusion.

Great! That bug exactly describes what I meant. That would surely reduce the confusion.
Flags: needinfo?(sebastinssanty)
Could you run this on both the host and inside the vagrant environment and paste the result?
$ env | egrep 'LC_|LANG'

(I'm wondering if you have a non-standard local set on the host which transfers to the guest, which then triggers https://github.com/oerdnj/deb.sury.org/issues/56)
Flags: needinfo?(sebastinssanty)
On my host:

LC_CTYPE=UTF-8

Inside vagrant:

LANG=en_US.UTF-8
LANGUAGE=en_US:
LC_CTYPE=UTF-8

Oh, I wasn't aware of such a problem.
Flags: needinfo?(sebastinssanty)
Hmm that looks relatively normal. For comparison I get this on the host:
LANG=en_US.UTF-8

...and this on the guest:
LANG=en_US.UTF-8
LANGUAGE=en_US:

I've just updated to the same vagrant version as you (I was already on the same virtualbox version), and tried another vagrant destroy/up which worked fine again. 

Could I just check which version of the `bento/ubuntu-16.04` box you are using? I'm on the latest, which is 2.3.5 (use `vagrant box list` to see installed boxes).

Other than that, it would be really helpful if you could try doing a `vagrant destroy -f && vagrant up` to see if it reproduces, and if so, pastebin-ing the entire log so we can try to file upstream bugs (I'd much prefer to have the issues fixed than just documented).
(In reply to Ed Morley [:emorley] from comment #7)

> Could I just check which version of the `bento/ubuntu-16.04` box you are
> using? I'm on the latest, which is 2.3.5 (use `vagrant box list` to see
> installed boxes).

I am also on the latest, which is 2.3.5

> Other than that, it would be really helpful if you could try doing a
> `vagrant destroy -f && vagrant up` to see if it reproduces, and if so,
> pastebin-ing the entire log so we can try to file upstream bugs (I'd much
> prefer to have the issues fixed than just documented).

Going w.r.t. to the first comment, and pasting logs accordingly

1) https://sebastin.pastebin.mozilla.org/9024565

A small error (can, be a possible consequence) also occured, solved by simple restart: https://sebastin.pastebin.mozilla.org/9024568

2) https://sebastin.pastebin.mozilla.org/9024584
Again need to change to Google's DNS + reinstall ca certs

3) https://sebastin.pastebin.mozilla.org/9024582

These errors were similar to what I faced last time.

Correction to solutions given in c#1: SSH, change to Google's DNS (as given in c#1) and install using the failing command and reprovision/restart vagrant.
Ah the full log makes things a bit clearer. Notably:
* The DNS resolution failure is actually occurring during Vagrant's own setup, even before we get to run the shell provisioner. (So there's not much we can do wrt adjusting DNS during provision, since that's already too late.)
* Not having guest additions set up properly will no doubt break other things.
* All of the other failures in comment 0 are likely all due to the same DNS resolution issue.

For comparison, here's the output I get when running `vagrant destroy -f && vagrant up`:
https://emorley.pastebin.mozilla.org/9024596

To save having to continually destroy your treeherder VM, and so we can come up with a reduced testcase for opening a ticket against Vagrant, could you create a new directory and save the following Vagrantfile into it, and see if the DNS error reproduces during `vagrant up`?

```
Vagrant.configure("2") do |config|
  config.vm.box = "bento/ubuntu-16.04"
  config.vm.box_version = "= 2.3.5"
end
```

If that fails too, can you repeat it (after a vagrant destroy in the same directory) with debugging enabled and save the full log?
https://www.vagrantup.com/docs/other/debugging.html

Thanks!
Flags: needinfo?(sebastinssanty)
Oh and:
* What version of OS X are you using?
* What does `nslookup us.archive.ubuntu.com` from the same terminal that you are running vagrant, say?
The Virtualbox log would also be useful (eg $HOME/VirtualBox VMs/{machinename}/Logs/Vbox.log), see:
https://www.virtualbox.org/manual/ch12.html#collect-debug-info
Attached file VBox.log
The VBox log. There were other logs too (namely, VBox.log.1, VBox.log.2, VBox.log.3)
Flags: needinfo?(sebastinssanty)
Attachment #8877929 - Attachment mime type: text/x-log → text/plain
(In reply to Ed Morley [:emorley] from comment #10)
> Oh and:
> * What version of OS X are you using?
Sierra Version 10.12.3

> * What does `nslookup us.archive.ubuntu.com` from the same terminal that you
> are running vagrant, say?

`
treeherder|master ⇒ nslookup us.archive.ubuntu.com
;; connection timed out; no servers could be reached
`
This shows up, event though I have 8.8.8.8 as one of my DNS.
Most probably, as you said, I guess that it is a problem of DNS resolution. I don't think there is any solution to most of it from treeherder side, then. Maybe just pointing to this DNS as a prevention for people who may face this problem.
(In reply to Sebastin Santy [:seban] from comment #13)
> treeherder|master ⇒ nslookup us.archive.ubuntu.com

Sorry by "the same terminal that you are running vagrant" I mean the host :-)
(In reply to Ed Morley [:emorley] from comment #14)
> > treeherder|master ⇒ nslookup us.archive.ubuntu.com
> 
> Sorry by "the same terminal that you are running vagrant" I mean the host :-)

Yes, that was on host.

Thanks for the debugging, I confirmed it was a DNS issue. I kept 8.8.8.8 as the highest preference DNS, and did a `vagrant destroy -f && vagrant up` and got it working. This is the complete log. This means a note can be put, to be careful of DNS problems.

Here is the output log: https://sebastin.pastebin.mozilla.org/9024720, similar to yours :-)

Thanks.
(In reply to Sebastin Santy [:seban] from comment #15)
> Thanks for the debugging, I confirmed it was a DNS issue. I kept 8.8.8.8 as
> the highest preference DNS, and did a `vagrant destroy -f && vagrant up` and
> got it working.

Ah great to hear! We probably should have started with the nslookup from the host at the start hehe.

As discussed on IRC (including here for posterity), looking at the attached VBox.log, there's:

    00:00:00.133951 NAT: resolv.conf: nameserver 192.30.252.128
    00:00:00.133958 NAT: resolv.conf: nameserver 10.1.1.61
    00:00:00.133961 NAT: resolv.conf: nameserver 10.1.1.62
    00:00:00.133964 NAT: resolv.conf: too many nameserver lines, ignoring 10.1.1.63
    00:00:00.133967 NAT: resolv.conf: too many nameserver lines, ignoring 8.8.8.8
    00:00:00.133979 NAT: Adding domain name domain.name
    00:00:00.133981 NAT: DNS#0: 192.30.252.128
    00:00:00.133984 NAT: DNS#1: 10.1.1.61
    00:00:00.133986 NAT: DNS#2: 10.1.1.62

Over IRC you provided the contents of your host's /etc/resolv.conf prior to the reordering:

    domain domain.name
    nameserver 192.30.252.128
    nameserver 10.1.1.61
    nameserver 10.1.1.62
    nameserver 10.1.1.63
    nameserver 8.8.8.8 

Plus also the output from `scutil --dns`:
    https://sebastin.pastebin.mozilla.org/9024733

Summary:
* For whatever reason, DNS resolution of 'us.archive.ubuntu.com' during Vagrant's `apt-get update` isn't working for you (log: [1]) using the DNS server 192.30.252.128 (your router) or your university DNS (10.1.1.*), whereas it is via Google DNS (8.8.8.8).
* Your original OS X DNS configuration *did* included Google DNS in the list, but as the 5th entry (the OS X 'System Preference -> network' UI doesn't appear to limit the number of DNS servers).
* OS X apps that use the native DNS API appear to be fine with having 5 DNS servers listed, and presumably eventually fall back to Google DNS when resolution fails via your router/university (albeit adding a delay).
* OS X also syncs the native DNS config to /etc/resolv.conf, for use by apps that aren't aware of the native DNS API (for example nslookup).
* However resolv.conf is expected to contain no more than 3 `nameserver` entries (set by `MAXNS` [2]), which is why `nslookup us.archive.ubuntu.com` on the host fails.
* Virtualbox has three different DNS modes when using the host-only NAT networking mode (see [3]):
  1) Default: The DNS configuration on the host is passed to the guest as a list of IP addresses, that are just used directly without going via the host.
  2) DNS proxy mode: Enabled via `--natdnsproxy1=on`, and means Virtualbox intercepts DNS requests from the guest and looks them up from the DNS servers itself.
  3) "Use host DNS resolver" mode: Enabled via `--natdnshostresolver1=on` (which overrides `--natdnsproxy1`), and makes Virtualbox intercepts DNS requests like mode 2, but instead Virtualbox's DNS proxy passes them to the host OS's resolver instead of communicating with the DNS servers itself.
* However Vagrant overrides Virtualbox's default and uses mode 2 most of the time (see [4]), unless /etc/resolv.conf is using 127.0.0.1 as the nameserver, when instead it uses mode 1 (see [5]).
* So in your case mode 2 is being used - which doesn't work, since Virtualbox's resolv.conf parser discards all but the first 3 DNS server entries (source: [6]).
* It's not clear whether mode 1 would work or not, but I suspect it wouldn't either. (You can try it by setting `vb.auto_nat_dns_proxy` to false in the Vagrantfile.)
* Your PR (which enables `--natdnshostresolver1`) is making Virtualbox use mode 3, which works presumably because the native OS X DNS resolver uses all DNS servers and not just the first three.

Ways to prevent this:
* Debug your router/University's DNS to figure out why one or both of them aren't resolving 'us.archive.ubuntu.com' (check using `nslookup us.archive.ubuntu.com 192.30.252.128` and `dig +trace us.archive.ubuntu.com @192.30.252.128` etc).
* Edit your OS X DNS config so that you don't exceed the 3 DNS server limit (or at least put the important ones in the first three). This will also improve DNS resolution speed. I'd suggest either only including one DNS server from your university rather than all three.
* Editing the Vagrantfile (like you have in your PR), which enables host DNS resolution mode (mode 3) for everyone. This will presumably slow down DNS requests since now they are routed via an additional layer.

Since the root cause here is "broken DNS servers" / "suboptimal DNS server settings on host", I'm hesitant to accept the PR straight away given there's an easy fix (reordering/reducing the size of the DNS server list), which will also improve your host OS's overall DNS performance too.

Upstream issues that could be filed:
* Apple:
  - Add a warning to the OS X DNS settings UI when adding over 3 entries, or else prevent adding that many in the first place (unlikely they'll agree haha).
  - Get them to add an additional comment to /etc/resolv.conf, saying that most tools that use it will only use the first three nameserver entries within.
* Virtualbox:
  - Document that only the first 3 DNS servers from the OS X config will be used, under "Known Issues" [7].
  - Make natdnshostresolver be enabled by default either just for this specific scenario (>3 DNS servers) or else all the time on OS X.
* Vagrant:
  - Improve the "DNS Not Working" section under common issues, since it mentions the proxy mode rather than the host resolver mode [8]
  - Expand the automatic DNS mode selection logic to force mode 3 if more than 3 DNS nameservers found in resolv.conf.

Next steps will be filing an issue against Vagrant, which I'm happy to do at the start of next week, unless you'd rather do so yourself.

This has certainly been an interesting bug deep dive! :-)


[1] https://sebastin.pastebin.mozilla.org/9024565
[2] http://manpages.ubuntu.com/manpages/xenial/man5/resolv.conf.5.html
[3] https://www.virtualbox.org/manual/ch09.html#nat-adv-dns
[4] https://github.com/mitchellh/vagrant/blob/v1.9.5/plugins/providers/virtualbox/action/sane_defaults.rb#L26-L27
[5] https://github.com/mitchellh/vagrant/blob/v1.9.5/plugins/providers/virtualbox/action/sane_defaults.rb#L72-L80
[6] https://github.com/mdaniel/virtualbox-org-svn-vbox-trunk/blob/60ef7002b78bcbc0a4affedf4728f07ffc52f073/src/VBox/Devices/Network/slirp/resolv_conf_parser.c#L192
[7] https://www.virtualbox.org/manual/ch14.html#KnownProblems
[8] https://www.vagrantup.com/docs/virtualbox/common-issues.html#dns-not-working
I haven't had a chance to write up the Vagrant ticket yet, however I've figured out why the Bento box is using us.archive.ubuntu.com rather than archive.ubuntu.com, and opened a PR to resolve it (if only since it will be slower for anyone outside the US):
https://github.com/chef/bento/pull/838

Could you check if `nslookup archive.ubuntu.com` succeeds from OS X with the original nameserver order? (I'm wondering if only the US one is blocked?)
Attachment #8878103 - Attachment is obsolete: true
Assigning to me to remind me to file the Vagrant ticket.
Assignee: nobody → emorley
Priority: -- → P2
Summary: Vagrant: Various errors encountered during installation that could be included in the doc → Vagrant up/provision fails with apt-get DNS errors ("Temporary failure resolving 'us.archive.ubuntu.com'")
My Bento box PR to switch the Ubuntu images from us.archive.ubuntu.com to archive.ubuntu.com has now been merged and will be in the next box release (these happen every few months).
Really sorry, not to respond. Was a bit bogged down with some GSoC work. The analysis is very comprehensive and it pretty much describes everything. I am not on the same DNS now, so both us.archive.ubuntu.com and archive.ubuntu.com works for me now. I guess removing us from the url was the right fix. Again, thanks a lot for taking up this issue actively :-).
(In reply to Ed Morley [:emorley] from comment #20)
> My Bento box PR to switch the Ubuntu images from us.archive.ubuntu.com to
> archive.ubuntu.com has now been merged and will be in the next box release
> (these happen every few months).

The next release (2.3.8) had a regression from someone else's PR, which meant my fix isn't used. I fixed that in:
https://github.com/chef/bento/pull/865

Once there's a new release after 2.3.8 hopefully all will work fine.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Treeherder: Docs & Development → TreeHerder
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: