(tracker) move services on dm-wwwbuild01 to *.{pub,pvt}.build.mozilla.org virtualhosts

RESOLVED FIXED

Status

P2
major
RESOLVED FIXED
8 years ago
5 years ago

People

(Reporter: aki, Assigned: dustin)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [vm][networking], URL)

(Reporter)

Description

8 years ago
The fact that build.mozilla.org is both a machine hostname and our primary subdomain is a constant source of communication confusion.

It's currently named dm-wwwbuild01? internally.

To change this, I think we need to change tbpl, Try, cruncher, and a number of developer-facing URLs (pending.html etc.).

This won't be low-touch and will require coordination with other teams, but will hopefully remove a self-inflicted cause of confusion.
(Reporter)

Comment 1

8 years ago
We could also accomplish this by setting up a second box, making it duplicate in functionality (or better!), pointing everything to that box, and then shutting off and repurposing the current build.mozilla.org box.
(Reporter)

Comment 2

8 years ago
I really don't know what status whiteboard entry to use here.
(In reply to comment #1)
> We could also accomplish this by setting up a second box, making it duplicate
> in functionality (or better!), pointing everything to that box, and then
> shutting off and repurposing the current build.mozilla.org box.

could we cname build.mozilla.org to something instead of starting over?
(Reporter)

Comment 4

8 years ago
The whole point is to get rid of the name build.mozilla.org in DNS entirely, so we know that whenever the words "build dot mozilla dot org" pass our lips, we're referring to the subdomain.
we could have it live in the interim and switch tools to the new cname
Whiteboard: [vm][networking]
Blocks: 618109
No longer blocks: 618109
So there's been some discussion of names in IRC, but one thing we didn't cover is the DNS questions.  Copying zandr to chase these down:

1. Can the DNS handle a public hostname in the *.build.mozilla.org subdomain, e.g., api.build.mozilla.org or clobberer.build.mozilla.org?

2. How does IT feel, in general, about using CNAME and vhosts to dissociate services from particular machines?

In the interests of someone taking a stand, and assuming the answers to the above are roughly "no" and "good", I'm proposing:

 releng-ws.mozilla.org - menu, contact info, links
 clobberer.releng-ws.mozilla.org - clobberer
 status.releng-ws.mozilla.org - status dashboard
 api.releng-ws.mozilla.org - buildapi/self-serve
 
where all of these are CNAMEs for the same machine, and that machine is publicly available?

The advantage to doing this rename at the same time as the beefing-up (bug 617129) is that we can do this slowly, even hosting both the new and old names on the existing VM for a while.  And we get to keep the VM and hostname around until bug 618109 (regarding internal services) is also finished.

There are infrasec concerns here, too.  In general, I think we could use some guidance from zandr when mental cycles are available.
Amy, can you answer some of the questions in the previous comment?
ravi, zandr (oops, forgot to copy you earlier)

1. Can the DNS handle a public hostname in the *.build.mozilla.org subdomain,
e.g., api.build.mozilla.org or clobberer.build.mozilla.org?

Comment 9

8 years ago
Yes, we can accommodate a public and a private IP existing in build.mozilla.org.  What I would very much like to avoid, however, is having split horizon where $HOSTNAME.build.mozilla.org has both a public and a private IP.  I'm not sure if that is possible with the RelEng infrastructure and will defer to your teams to try to avoid creating this condition.

Also re: Comment 3 for those interested in knowing:

build.mozilla.org is a SOA which is mutually exclusive from being a CNAME.
Blocks: 655832
Blocks: 655794
OK, this is now creating blocking bugs, so we should get moving.  Based on what ravi suggests, I'd like to create CNAMEs

 pub.build.mozilla.org (menu)
 clobberer.pub.build.mozilla.org
 buildapi.pub.build.mozilla.org
 trychooser.pub.build.mozilla.org

all pointing to the external IP for dm-wwwbuild01.  These should resolve both inside and outside the build VPN, and as ravi asked in comment 9 should resolve to the same address.  I hope I've understood correctly.

I'm happy to add other names to this list, as we discover more apps that we're hosting on this system.

Aki, does this sound good?  Ravi?
Severity: normal → major
Priority: P3 → P2
Amy points out that db-wwwbuild01 will resolve in mtv1 to an address that non-build people cannot access, so simply making CNAMES to dm-wwwbuild01 won't work.  So a more correct and detailed suggestion:

 dm-wwwbuild01-ext.mozilla.org IN A 63.245.208.186
which is the same as dm-wwwbuild01's external address

 clobberer.pub.build.mozilla.org IN CNAME dm-wwwbuild01-ext.mozilla.org.
set to resolve identically both inside and outside the build network.  Same for all of the other aliases (buildapi, trychooser, etc.)

We can come up with another solution for internal clobberer requests (from slaves).

Once we're agreed on this, I'll break it down into bugs for each service, as they will each have their hairy bits.
As I mentioned to Dustin, I think what should probably happen is that we create an A record that points to 63.245.208.186 that does not have an internal 10.x.x.x record (so there are no issues with separate dns views).  We can then CNAME each of the services to that external mozilla.org A record (so we only have to change one record should it need to change in the future).

Some things that might be impacted are SSL certificates or things that do forward and reverse dns matching.  Do we use anything sensitive to that on bmo?
I don't believe so, but we can work that out in the per-service bugs.  Using a pseudo-subdomain like pub.build.mozilla.org will let us use wildcard SSL certs if necessary.

Comment 14

8 years ago
But build.mozilla.org is already a sub domain...
Ravi, Amy, and I should talk briefly about this via IRC tomorrow.  Ravi, do you want to grab us when you have a chance?
(Reporter)

Comment 16

8 years ago
(In reply to comment #10)
> Aki, does this sound good?

It a) stops using a CNAME that conflicts with our SOA/subdomain, and b) has a CNAME per service so we can move them easily.  And I don't object to any other specifics, so thumbs up from me.

I do think there's a build.m.o SSL cert on dm-wwwbuild01, but I don't know specifics.
(Reporter)

Comment 17

8 years ago
[18:19]	<aki>	catlee-away: will an external address for clobberer break buildslaves' access ?
[18:19]	<aki>	or will they use the internal ip
[18:19]	<catlee-away>	that depends
[18:19]	<catlee-away>	all they need is to be able to hit it without ldap auth
[18:19]	<catlee-away>	but if we lock down the build network, then they'll need an internal ip
[18:20]	<catlee-away>	or an exception in the firewall
[18:21]	<aki>	hm, should we put that in the bug?
[18:21]	<bear-afk>	so many things may be hitting *.build.mozilla.org from inside that a single ip exception would cover it
[18:21]	<catlee-away>	I don't care if we hit an external ip or not
[18:21]	<catlee-away>	as long as it works
[18:21]	<bear-afk>	still something that should be spelled out in the bug
[18:22]	<catlee-away>	multi-homed hosts always seem to cause problems
[18:22]	<bear-afk>	this isn't necessarily multihomed

I *think* clobberer is the only one of the above that the slaves use, but I might be mistaken.
We also pull a bunch of stuff from build.m.o for talos. Some of these files are private.
Right - I'll dissect the services on sub-bugs.  All of them will need some care.  Internal services can hit things via a different hostname, so that's not a problem.

I'm not sure what ravi meant in comment 14 - build.mozilla.org is a subdomain, but as I understand it we can still use an SSL cert with *.pub.build.mozilla.org that will cover any SSL'd services we put there.  Ravi, does that about cover it?  Any other objections or tweaks to the plan?

If not, I'll file some dependent bugs for the services I know of and get this ball rolling.
I believe Ravi meant that you can't use (for example) a *.mozilla.org wildcard SSL certificate for something under *.build.mozilla.org or *.pub.build.mozilla.org, as RFC 2818 dictates that wildcard certificates only go to the first sub-component ("E.g., *.a.com matches foo.a.com but not bar.foo.a.com. f*.com matches foo.com but not bar.com." --RFC2818).
Depends on: 657024
Depends on: 657025
Depends on: 657026
Depends on: 657046
Depends on: 657359
Depends on: 657361
Depends on: 657362
No longer blocks: 655832
Depends on: 655832
Depends on: 657784
Depends on: 658024
I'll take this for a bit to sort out the IT vs. releng parts of it, and make sure the IT side gets done.
Assignee: nobody → dustin
So I need some downtime in which I can puppetize dm-wwwbuild01 and make sure that it continues serving all of the fun stuff it serves now.  Once that's done, it's relatively straightforward to move one service at a time to a vhost, using puppet.

This can easily ride along with any other downtimes, and does not deserve its own.
Flags: needs-treeclosure?
Depends on: 674665
I know we've had workweek and whatnot, but I'd like to get this moving again - can we schedule a downtime for this next week?
(In reply to Dustin J. Mitchell [:dustin] from comment #23)
> I know we've had workweek and whatnot, but I'd like to get this moving again
> - can we schedule a downtime for this next week?

Yeah, I'm looking to. How is 9am EDT on Wednesday for you?
Sounds good to me.
Per comment 22, this downtime is for puppetizing the host - bug 674665, which is infra-only because it involves infra puppet configs.  This is using existing infra classes, so it makes some non-obvious changes to the Apache configuration.  I *think* that the resulting system will work the same way the existing system does, but I can't be sure.  Hence the downtime.

Once this is in place, we'll need to watch for build failures related to changes in services provided by this system.  I know about clobberer and the talos downloads, but there may be subtleties in how those are implemented,  or other unknown services.

As for rollback, I'll back up /etc/httpd before puppetizing, and that will provide a potential rollback strategy.
Summary: rename build.mozilla.org (the machine, not the subdomain) → (tracker) move services on dm-wwwbuild01 to *.{pub,pvt}.build.mozilla.org virtualhosts
Flags: needs-treeclosure? → needs-treeclosure+
This puppet change was landed successfully.
Flags: needs-treeclosure+
So everything from here on out can be done without significant risk, as follows:

1. Set up a new vhost, copying the config out of that for build.mozilla.org
2. Test out that vhost by using staging slaves or pointing a select few devs at it or whatever
3. Point everything to that vhost
4. Verify and wait until nothing's looking at build.mozilla.org anymore
5. Remove the copied config from build.mozilla.org

I'll take one of the dependent bugs, as a model, and then hopefully releng can handle scheduling and testing the rest -- I'll do the Puppet/Apache changes.
Assignee: dustin → nobody

Updated

7 years ago
Duplicate of this bug: 614629
Assignee: nobody → dustin
This work will continue in parallel with work in bug 774354 to move these new services to a web cluster.
Blocks: 774354
Depends on: 702337
http{,s}://build.mozilla.org is still hosted on relengweb1.dmz.scl3, and will not be supported on the new releng cluster, in hopes we can finally close this almost-two-year-old bug :)

My rough plan is as follows:
 - serve everything from virtualhosts on the new releng cluster (bug 774354)
 - make sure bugs are on file to start using those virtualhosts (blocking this bug)
 - add 301 redirects to build.mozilla.org paths where/when it won't cause failures
 - keep build.mozilla.org hosted on relengweb1 as-is until everyone's satisfied
    (cert expires 3/2/14, so let's call that the deadline)
 - kill relengweb1 when build.mozilla.org is no longer used.
I'll put this bug back in the releng queue after the first step, as the blocking bugs are out of scope for me, but I'm still happy to help where I can.  I'll do the 301's and VM-killing when the time comes.

I surveyed the logs for http{,s}://build.mozilla.org for yesterday (August 6), to make sure there's nothing we're missing still on the host.  Here's what I found:

http:
/builds - see below
/tryserver-symbols - bug 702337
/clobberer - bug 657024
/talos - bug 657046
/trychooser - already 302'ing

https:
/buildapi - all POSTs; see below
/clobberer - bug 657024
/trychooser - already 302'ing
/tryserver-builds - only bots, unused per 729667 comment 10
/update-bump-unit-tests - bug 657361

For http://build.mozilla.org/builds, bug 657359 comment 5 explains most of the content (except buildfaster.csv.gz).  All of this content is rsync'd from cruncher regularly, so it's easy to mirror it elsewhere while still serving it at its existing URL.  I'll take care of that in bug 657359, then file a bug blocking this one to change incoming links.

For https://build.mozilla.org/buildapi, all of the incoming requests (and there are lots) are from autoland.  So we can fix that up, then I can add a 301.  I'll file a bug blocking this one to make the autoland fix.

As for the 301's:
  already 302'ing, change to 301:
http://build.mozilla.org/trychooser -> http://trychooser.pub.build.mozilla.org
https://build.mozilla.org/trychooser -> http://trychooser.pub.build.mozilla.org

  pending bug 657359 + link-fixing bug:
http://build.mozilla.org/builds -> http://builddata.pub.build.mozilla.org/reports

  pending autoland fix:
https://build.mozilla.org/buildapi -> https://secure.pub.build.mozilla.org/buildapi

  pending bug 657024:
https://build.mozilla.org/clobberer -> https://secure.pub.build.mozilla.org/clobberer
https://build.mozilla.org/clobberer-stage -> https://secure.pub.build.mozilla.org/clobberer-stage
http://build.mozilla.org/clobberer -> https://secure.pub.build.mozilla.org/clobberer
http://build.mozilla.org/clobberer-stage -> https://secure.pub.build.mozilla.org/clobberer-stage

  pending bug 702337:
http://build.mozilla.org/tryserver-symbols -> ??
Depends on: 780899
No longer blocks: 774354
301's for trychooser converted from 302's.
301's listed as "pending bug 657024" above are added in puppet.
301's listed as "pending autoland fix" above are added to puppet.

That leaves:

  pending bug 657359 + link-fixing bug:
http://build.mozilla.org/builds -> http://builddata.pub.build.mozilla.org/reports

  pending bug 702337:
http://build.mozilla.org/tryserver-symbols -> ??
The remaining work on this bug is in the dependent bugs, and all are releng tasks.  I'm cc'd, so I'll take care of the 301's as necessary.
Assignee: dustin → nobody
Blocks: 787394
(Reporter)

Comment 34

6 years ago
I think relengweb1 has become our tooltool server as well.
Correct:

dustin@Lorentz ~ $ host tooltool.pub.build.mozilla.org
tooltool.pub.build.mozilla.org is an alias for relengweb-zlb.vips.scl3.mozilla.com.
relengweb-zlb.vips.scl3.mozilla.com has address 63.245.215.17
(In reply to Aki Sasaki [:aki] from comment #34)
> I think relengweb1 has become our tooltool server as well.

And to be clear, this is not hosted on http://build.mozilla.org, so not related to this bug.

The action remaining on this bug is in the dependencies.
I just did one more sweep of the access logs for build.mozilla.org, both http/https.  Other than a bunch of bots and pen-testers, everything of interest is either /talos (bug 657046) or /builds (bug 657359).  So, we're close on this!

Once those are finished, this host will only serve 301's.  At that point, I'll move it to the new cluster.
Depends on: 883174
All that remains is bug 657046 - ironically probably the most production-sensitive of the services on this VM!
On Sept 30, we'll take down http://build.mozilla.org and https://build.mozilla.org permanently.
I see an authenticated user downloading /talos/zips/tp5n.zip from outside the build network.  If that should be allowed, let me know.  My impression is that that's not expected and should be disallowed in the new implementation.
Flags: needinfo?(coop)
(In reply to Dustin J. Mitchell [:dustin] from comment #40)
> I see an authenticated user downloading /talos/zips/tp5n.zip from outside
> the build network.  If that should be allowed, let me know.  My impression
> is that that's not expected and should be disallowed in the new
> implementation.

I can't think of a reason why that access should exist. Let's nix it. We can find another way to get people those files should they really need them.
Flags: needinfo?(coop)
Thanks - that shouldn't be hard to add if necessary, at a sub-URI of https://secure.pub.b.m.o.
Ok, I have build.m.o implemented on the releng cluster.  It's actually three vhosts with the same name:
 1. http://build.mozilla.org where it resolves to an internal IP
 2. http://build.mozilla.org where it resolves to an external IP
 3. https://build.mozilla.org where it resolves to an external IP

1 and 2 are similar, except that 2 does not serve talos.  There's no internal counterpart to 3 - I don't see any such accesses in the logs.  1 is the only production-critical vhost.  I tested it by comparing the old and new with:

  curl -H "Host: build.mozilla.org" http://10.22.74.128/talos/findlinks/index.html
  curl -H "Host: build.mozilla.org" http://10.22.74.160/talos/findlinks/index.html

(that being the first file I found that wasn't huge and binary).

I'll file a CAB bug to move this last vhost off of the old server, which involves both CNAME changes (internally) and taking over the existing IP for build.mozilla.org in zeus (externally).
Depends on: 886468
http{,s}://build.mozilla.org is now hosted on the releng cluster.  This bug will stay open until Sept 30, per comment 39.  Bug 657046 still blocks this change, but there are two whole months to close it!
Assignee: nobody → dustin
No longer blocks: 787394
Product: mozilla.org → Release Engineering
Dependencies are closed, so http{,s}://build.mozilla.org, both internally and externally, has 45 days to live.
For September to date:

Internal:

10.22.81.211 - - [02/Sep/2013:00:01:01 -0700] "GET /builds/last-job-per-slave.txt HTTP/1.0" 301 277 "-" "Wget/1.12 (linux-gnu)"
10.22.81.211 - - [02/Sep/2013:18:04:11 -0700] "GET /builds/pending/pending.html HTTP/1.1" 301 275 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0"
10.22.81.211 - - [03/Sep/2013:00:01:01 -0700] "GET /builds/last-job-per-slave.txt HTTP/1.0" 301 277 "-" "Wget/1.12 (linux-gnu)"
10.22.81.211 - - [03/Sep/2013:06:33:56 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:06:52:32 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:07:49:34 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:08:03:37 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:11:49:50 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:12:17:46 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:12:57:32 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:13:14:53 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:14:36:00 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:14:49:41 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [03/Sep/2013:15:11:20 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [04/Sep/2013:08:14:57 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [04/Sep/2013:08:25:00 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [04/Sep/2013:08:52:13 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [04/Sep/2013:09:04:16 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [04/Sep/2013:13:45:25 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [05/Sep/2013:07:31:48 -0700] "GET /talos/zips/talos.fcbb9d7d3c78.zip HTTP/1.1" 200 17031973 "-" "Python-urllib/2.6"
10.22.81.211 - - [05/Sep/2013:07:31:48 -0700] "GET /talos/zips/talos.fcbb9d7d3c78.zip HTTP/1.1" 200 17031973 "-" "Python-urllib/2.6"
10.22.81.211 - - [05/Sep/2013:07:54:39 -0700] "GET /talos/zips/retry.zip HTTP/1.1" 200 3510 "-" "Python-urllib/2.7"
10.22.81.211 - - [05/Sep/2013:08:02:29 -0700] "GET /talos/zips/talos.fcbb9d7d3c78.zip HTTP/1.1" 200 17031973 "-" "Python-urllib/2.6"

External:
 - lots of bots -- you can tell which bots don't cache 301's
 - a few dozen hits to / from a copy of Firefox 4.0 on Windows.
 - a dozen or so /clobberer hits that look like they're from a browser (301's)
 - a few /builds/pending hits, similar (301's)
 - a few /tryserver-builds hits that seem to be pingbacks from blog software

So I think the turn-off on the 30th is going to be uneventful.
last reminder sent
httpd config removed
A records removed:

build.mozilla.org 	A 	10.22.74.160
build.mozilla.org 	A 	63.245.215.17

Kim, it looks like we're still seeing a few hits to retry.zip per hour.  On the fifth, we assumed these were from rebuilds on try, but it seems the URL is still present in
  http://hg.mozilla.org/build/mozharness/annotate/ee2caa1098b9/configs/android/android_panda_talos_releng.py#l15
which you're the last to touch.  Can you change that to the new URL, http://talos-bundles.pvt.build.mozilla.org/zips/retry.zip?

TODO: verify that VIP 63.245.215.17 is now unused and remove it from DNS and Zeus.
Flags: needinfo?(kmoir)

Updated

5 years ago
Depends on: 922042
Ed landed this fix in bug 922042
Flags: needinfo?(kmoir)
* TIG 63.245.215.17 removed from VS releng and releng-https
* TIG 63.245.215.17 deleted
* Forward, reerse DNS for 63.245.215.17 removed.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED

Updated

5 years ago
Depends on: 926868
You need to log in before you can comment on or make changes to this bug.