Closed Bug 646046 Opened 13 years ago Closed 12 years ago

turn off external network access for staging and preproduction machines

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: bhearsum, Assigned: mrz)

References

Details

Attachments

(1 file)

Please turn off external network access for the following machines:
moz2-linux-slave03, 04, 10, 17, 51
mv-moz2-linux-ix-slave01
moz2-linux64-slave07, 10
moz2-darwin9-slave03, 08, 68, 010
moz2-darwin10-slave01, 02, 03, 04, 010
mw32-ix-slave01
win32-slave03, 04, 010, 21, 60
talos-r3-fed-001, 002, 010
talos-r3-fed64-001, 002, 010
talos-r3-leopard-001, 002, 010
talos-r3-snow-001, 002, 010
talos-r3-w7-001, 002, 003, 010
talos-r3-xp-001, 002, 003, 010
t-r3-w764-001, 002, 010


But please allow them to access all Mozilla mirror nodes.
(In reply to comment #0)

> But please allow them to access all Mozilla mirror nodes.

a) Where can I get that list?
b) Is that really a good idea?
   Help me understand this requirement. It seems to me that we're going to lock down these machines tightly except for letting them talk to machines all over the internet that are probably friendly. That doesn't quite scan.

As an aside, I don't think we've dealt with NTP, so I'll allow 123 out for now and file a separate bug to fix that and close that off.
(In reply to comment #1)
> (In reply to comment #0)
> 
> > But please allow them to access all Mozilla mirror nodes.
> 
> a) Where can I get that list?

https://nagios.mozilla.org/sentry/ may be complete, probably best to check with justdave though.

> b) Is that really a good idea?
>    Help me understand this requirement. It seems to me that we're going to lock
> down these machines tightly except for letting them talk to machines all over
> the internet that are probably friendly. That doesn't quite scan.

As part of every release we have a test that goes through all of the releasetest snippets and checks links, which eventually point to various mirror nodes. We do plan to make changes that don't require us to hit Mozilla mirrors, but in the meantime we need to make sure this test continues to work.

> As an aside, I don't think we've dealt with NTP, so I'll allow 123 out for now
> and file a separate bug to fix that and close that off.

Don't the build machines sync to ntp.build.m.o?
(In reply to comment #2)
 
> https://nagios.mozilla.org/sentry/ may be complete, probably best to check with
> justdave though.

Netops can't be asked to maintain a whitelist that looks like that, so we're going to have to do it ourselves. This means if we want to turn off access before we fix the releasetest snippet checks, we need to build and install a proxy so we can maintain that whitelist.

> Don't the build machines sync to ntp.build.m.o?

Not really, no. Filed bug 646056 which blocks the tracking bug.
(In reply to comment #3)
> (In reply to comment #2)
> 
> > https://nagios.mozilla.org/sentry/ may be complete, probably best to check with
> > justdave though.
> 
> Netops can't be asked to maintain a whitelist that looks like that, so we're
> going to have to do it ourselves. This means if we want to turn off access
> before we fix the releasetest snippet checks, we need to build and install a
> proxy so we can maintain that whitelist.

We just chatted on IRC about this a bit. Based on us adding a new mirror every week or so, this option still isn't very good, because we'll likely end up behind, and cause erroneous test failures. We're strongly leaning towards blocking this on changing the releasetest checks to not touch external mirrors.
Nthomas came up with the idea of using a special bouncer region/country to shunt all requests from build machines to internal mirrors only. If we do that, we only need to whitelist the external IPs of those mirrrors, which rarely ever change. Filed bug 646076 on this.
Depends on: 646076
Assignee: server-ops-releng → zandr
Depends on: 656916
With bug 646076 resolved we can go ahead with this:
> Please turn off external network access for the following machines:
> moz2-linux-slave03, 04, 10, 17, 51
> mv-moz2-linux-ix-slave01
> moz2-linux64-slave07, 10
> moz2-darwin9-slave03, 08, 68, 010
> moz2-darwin10-slave01, 02, 03, 04, 010
> mw32-ix-slave01
> win32-slave03, 04, 010, 21, 60
> talos-r3-fed-001, 002, 010
> talos-r3-fed64-001, 002, 010
> talos-r3-leopard-001, 002, 010
> talos-r3-snow-001, 002, 010
> talos-r3-w7-001, 002, 003, 010
> talos-r3-xp-001, 002, 003, 010
> t-r3-w764-001, 002, 010



> But please allow them to access all Mozilla mirror nodes.

And of course, this is no longer necessary, these machines should be blocked from ALL external access.
I somehow missed bug 646056 until today. We're actually blocked on it (which is blocked on the DNS overhaul) before we can test this out in staging. Sorry for the back and forth.
Depends on: 646056
No longer depends on: 656916
We're finally ready for this, for realsies:
(In reply to comment #6)
> With bug 646076 resolved we can go ahead with this:
> > Please turn off external network access for the following machines:
> > moz2-linux-slave03, 04, 10, 17, 51
> > mv-moz2-linux-ix-slave01
> > moz2-linux64-slave07, 10
> > moz2-darwin9-slave03, 08, 68, 010
> > moz2-darwin10-slave01, 02, 03, 04, 010
> > mw32-ix-slave01
> > win32-slave03, 04, 010, 21, 60
> > talos-r3-fed-001, 002, 010
> > talos-r3-fed64-001, 002, 010
> > talos-r3-leopard-001, 002, 010
> > talos-r3-snow-001, 002, 010
> > talos-r3-w7-001, 002, 003, 010
> > talos-r3-xp-001, 002, 003, 010
> > t-r3-w764-001, 002, 010
Both of the currently dependent bugs are 95% done, and don't block us from moving forward here, removing dependency.
No longer depends on: 646056, 646076
Just to confirm before I lob this over to netops:

That list of machines is all staging/preproduction, so we do not need a downtime window for this change.

Put another way, if those machines are knocked out by this change, the tree stays open, yes?

Also, t-r3-w764-010 should not be on the list, as it's now talos-r3-fed-058
(In reply to comment #10)
> Just to confirm before I lob this over to netops:
> 
> That list of machines is all staging/preproduction, so we do not need a
> downtime window for this change.
> 
> Put another way, if those machines are knocked out by this change, the tree
> stays open, yes?

Correct.

> Also, t-r3-w764-010 should not be on the list, as it's now talos-r3-fed-058

K.
So, over to netops then.

Let's start with this set, because they're all scl1.

> > win32-slave03, 04, 010, 21, 60
> > talos-r3-fed-001, 002, 010
> > talos-r3-fed64-001, 002, 010
> > talos-r3-leopard-001, 002, 010
> > talos-r3-snow-001, 002, 010
> > talos-r3-w7-001, 002, 003, 010
> > talos-r3-xp-001, 002, 003, 010
> > t-r3-w764-001, 002

These machines should no longer be allowed to connect out to the internet.
Assignee: zandr → network-operations
Component: Server Operations: RelEng → Server Operations: Netops
QA Contact: zandr → mrz
> > win32-slave03, 04, 010, 21, 60
are in SJC, so it should already be blocked.

Done for the others.

[root@talos-r3-fed64-001 ~]# wget google.fr
--2011-06-08 14:58:57--  http://google.fr/
Resolving google.fr... 74.125.115.99, 74.125.115.103, 74.125.115.104, ...
Connecting to google.fr|74.125.115.99|:80... ^C
Over to release engineering to make sure things are still working. Punt it back to Server Ops:Releng when you're happy that we haven't broken anything.
Assignee: network-operations → nobody
Component: Server Operations: Netops → Release Engineering
QA Contact: mrz → release
(In reply to comment #13)
> > > win32-slave03, 04, 010, 21, 60
> are in SJC, so it should already be blocked.

Is the implication here that build hosts in SJC already don't have internet access? I don't think that's the case...:
D:\mozilla-build\wget>hostname
win32-slave04

D:\mozilla-build\wget>wget google.fr
--05:13:19--  http://google.fr/
           => `index.html.1'
Resolving google.fr... 74.125.115.105, 74.125.115.106, 74.125.115.147, ...
Connecting to google.fr|74.125.115.105|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://www.google.fr/ [following]
--05:13:20--  http://www.google.fr/
           => `index.html.1'
Resolving www.google.fr... 74.125.115.105, 74.125.115.106, 74.125.115.147, ...
Reusing existing connection to google.fr:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

    [ <=>                                                           ] 9,475         --.--K/s

05:13:20 (68.09 MB/s) - `index.html.1' saved [9475]


Tossing this back to get these hosts dealt with.


> Done for the others.
> 
> [root@talos-r3-fed64-001 ~]# wget google.fr
> --2011-06-08 14:58:57--  http://google.fr/
> Resolving google.fr... 74.125.115.99, 74.125.115.103, 74.125.115.104, ...
> Connecting to google.fr|74.125.115.99|:80... ^C

Thanks!
Assignee: nobody → network-operations
Component: Release Engineering → Server Operations: Netops
QA Contact: release → mrz
(In reply to comment #15)
> (In reply to comment #13)
> > > > win32-slave03, 04, 010, 21, 60
> > are in SJC, so it should already be blocked.
> 
> Is the implication here that build hosts in SJC already don't have internet
> access? I don't think that's the case...:

Perhaps, though I agree that's not the case. However, before I take up any more of netops' time: 

(In reply to comment #14)
> Over to release engineering to make sure things are still working. Punt it
> back to Server Ops:Releng when you're happy that we haven't broken anything.

I don't see that statement in comment 15.
Assignee: network-operations → nobody
Component: Server Operations: Netops → Release Engineering
QA Contact: mrz → release
To clarify, my intent was to do this for a single DC first. grabbing the win32 VMs in sjc1 was a mistake. If talos looks OK with access shut off, we'll move on to shutting down access out of sjc1
(In reply to comment #17)
> To clarify, my intent was to do this for a single DC first. grabbing the
> win32 VMs in sjc1 was a mistake. If talos looks OK with access shut off,
> we'll move on to shutting down access out of sjc1

Gotcha, sorry about the confusion.
Assignee: nobody → bhearsum
I'm satisfied with the testing on the SCL machines, can we do the rest of them now? I believe the remaining are:
> > > moz2-linux-slave03, 04, 10, 17, 51
> > > mv-moz2-linux-ix-slave01
> > > moz2-linux64-slave07, 10
> > > moz2-darwin9-slave03, 08, 68, 010
> > > moz2-darwin10-slave01, 02, 03, 04, 010
> > > mw32-ix-slave01
> > > win32-slave03, 04, 010, 21, 60



Additionally, is it possible to get a list of all the specific things these new rules have blocked, over the next 6 hours or so? Eg, if a machine is attempting to connect to microsoft.com or time.apple.com still I'd like to see the source & destination IPs.
Assignee: bhearsum → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Attached file Blocked traffic
A small portion of the blocked traffic for this rule.
The firewall logs are full of blocked ntp traffic to time.apple.com, it would be nice if you could change that, or I can allow traffic to there.
(In reply to comment #20)
> Created attachment 538576 [details]
> Blocked traffic
> 
> A small portion of the blocked traffic for this rule.
> The firewall logs are full of blocked ntp traffic to time.apple.com, it
> would be nice if you could change that, or I can allow traffic to there.

All of the machines in scl1 capable of doing so should be using dhcp to get their ntp servers (and dhcp is handing out 10.12.75.10, 10.12.75.12, 10.2.71.5 right now)
(In reply to comment #21)
> (In reply to comment #20)
> > Created attachment 538576 [details]
> > Blocked traffic
> > 
> > A small portion of the blocked traffic for this rule.
> > The firewall logs are full of blocked ntp traffic to time.apple.com, it
> > would be nice if you could change that, or I can allow traffic to there.
> 
> All of the machines in scl1 capable of doing so should be using dhcp to get
> their ntp servers (and dhcp is handing out 10.12.75.10, 10.12.75.12,
> 10.2.71.5 right now)

Mac and Windows actually don't know how to obey DHCP-provided ntp servers. Over in bug 646056 I rolled out changes that should've got all machines syncing to ntp.build.mozilla.org, instead.

The strange part is that the six hosts listed in the log are:
10.12.49.190: talos-r3-fed64-010.build.scl1.mozilla.com.
10.12.50.111: talos-r3-xp-003.build.scl1.mozilla.com.
10.12.50.118: talos-r3-xp-010.build.scl1.mozilla.com.
10.12.50.163: talos-r3-w7-002.build.scl1.mozilla.com.
10.12.50.171: talos-r3-w7-010.build.scl1.mozilla.com.
10.12.50.55: talos-r3-snow-002.build.scl1.mozilla.com.

snow-002 hasn't synced with Puppet yet, so I can understand why it would be talking with time.apple.com still. But the others aren't even running OS X, and in fact, are configured to talk with ntp.build.mozilla.org. Unless minis have some sort of hardware level NTP client, I'm baffled.
As for the other two hosts in that log, they are:
144.50.95.208.in-addr.arpa domain name pointer mirror.liberty.edu.
65.182.224.39 (doesn't resolve, but appears to be a PHP.net mirror)

I'm not concerned about blocking either of them. The former is a CentOS/fedora mirror and likely happened because someone ran 'yum search' or something. Not sure about the latter, but I can't find any references to php.net in mxr (that aren't comments), so it's doubtful anything bad is happening.
(In reply to comment #22)
> The strange part is that the six hosts listed in the log are:
> 10.12.49.190: talos-r3-fed64-010.build.scl1.mozilla.com.

Actually, this one only accessed the fedora mirror and php.net, so maybe there's something that gets installed on Windows through Boot Camp that mucks with time. Still looking into it.

> 10.12.50.111: talos-r3-xp-003.build.scl1.mozilla.com.
> 10.12.50.118: talos-r3-xp-010.build.scl1.mozilla.com.
> 10.12.50.163: talos-r3-w7-002.build.scl1.mozilla.com.
> 10.12.50.171: talos-r3-w7-010.build.scl1.mozilla.com.
> 10.12.50.55: talos-r3-snow-002.build.scl1.mozilla.com.
(In reply to comment #24)
> Actually, this one only accessed the fedora mirror and php.net, so maybe
> there's something that gets installed on Windows through Boot Camp that
> mucks with time. Still looking into it.

Indeed, there's an AppleTimeSrv, which claims to keep time in sync when booting back and forth between OS X and Windows. I'll track disabling it or otherwise fixing it to use our NTP server in bug 646056.

Thanks for pulling this log, it's super helpful!
As noted in https://bugzilla.mozilla.org/show_bug.cgi?id=650436#c4 this breaks Windows Activation. Current solution is to get the machines activated on the cage WiFi network, because telephone activation sucks.
AIUI, the alternatives for windows activation are:
 * Network access to activation servers from build network
 * WiFi for activation
 * Dedicated switchport + long ethernet cable for activation

Others? Which is preferred?
(In reply to comment #27)
> AIUI, the alternatives for windows activation are:
>  * Network access to activation servers from build network

Non-trivial set of hosts: http://www.oleksiygayda.com/2010/08/how-to-windows-activation-firewall.html

>  * WiFi for activation

This seems to work OK, but it's yet more manual configuration to enable and then remember to disable the wireless connection.

>  * Dedicated switchport + long ethernet cable for activation

Which leaves us with only being able to activate one host at a time, and it needs to be a different long cable than I'm using for imaging (or I need to turn up deploystudio on vlan75 as well. Not my first choice)

> Others? Which is preferred?

WiFi, so far.

http://technet.microsoft.com/en-us/library/ff793432.aspx is interesting reading but may not apply to our situation.
It looks like the MTV1 hosts were missed, specifically:
> > > > mv-moz2-linux-ix-slave01
> > > > moz2-darwin10-slave01, 02

I'm able to ping external hosts from these machines.
(In reply to comment #29)
> It looks like the MTV1 hosts were missed, specifically:
> > > > > mv-moz2-linux-ix-slave01
> > > > > moz2-darwin10-slave01, 02
> 
> I'm able to ping external hosts from these machines.

*un*able, sorry.
(In reply to comment #30)
> (In reply to comment #29)
> > It looks like the MTV1 hosts were missed, specifically:
> > > > > > mv-moz2-linux-ix-slave01
> > > > > > moz2-darwin10-slave01, 02
> > 
> > I'm able to ping external hosts from these machines.
> 
> *un*able, sorry.

No, no, my original comments was correct. Sorry for the churn here :(.
(In reply to comment #31)
> (In reply to comment #30)
> > (In reply to comment #29)
> > > It looks like the MTV1 hosts were missed, specifically:
> > > > > > > mv-moz2-linux-ix-slave01
> > > > > > > moz2-darwin10-slave01, 02
> > > 
> > > I'm able to ping external hosts from these machines.
> > 
> > *un*able, sorry.
> 
> No, no, my original comments was correct. Sorry for the churn here :(.

These are in mtv1. Again, I want to do this one DC at a time. And mtv1 I want to do this as part of the new firewalls, so we're a little bit away from being ready to go.
So the current state is:
 scl1: staging machines blocked
 sjc1: staging machines open
 mtv1: staging machines open

From comment 32, there are new firewalls on the way for mtv1 so it's best to lump this change in with their turn-up, so the immediate next step is staging machines in sjc1.
Assignee: server-ops-releng → zandr
(In reply to comment #28) 
> > Others? Which is preferred?
> 
> WiFi, so far.
> 
> http://technet.microsoft.com/en-us/library/ff793432.aspx is interesting
> reading but may not apply to our situation.

Is it worth filing a separate IT bug to explore the various options here? What's our path forward?
I opened bug 667045 for the purpose.
(In reply to comment #28)
> (In reply to comment #27)
> > AIUI, the alternatives for windows activation are:
> >  * Network access to activation servers from build network
> 
> Non-trivial set of hosts:
> http://www.oleksiygayda.com/2010/08/how-to-windows-activation-firewall.html

I only see 9 hosts here. Does this list change frequently? Is there something else I'm missing on why we cant just whitelist 9 machines and be done?
Wrong bug - copied to bug 667045.
(In reply to comment #33)
> So the current state is:
>  scl1: staging machines blocked
>  sjc1: staging machines open
>  mtv1: staging machines open
> 
> From comment 32, there are new firewalls on the way for mtv1 so it's best to
> lump this change in with their turn-up, so the immediate next step is
> staging machines in sjc1.

I was corrected on this point the day I wrote it, but didn't update the bug.  The firewalls in sjc1 are already overloaded, so we would need new firewalls to implement blocking there.  However, sjc1 is quite the rat's nest at a variety of levels, so that dc will not be getting new firewalls, and thus will not have outgoing access blocked.

So this project is currently waiting on
 (a) bouncer fixes (bug 646076)
 (b) new firewalls in mtv1 (part of bug 649422)

(a) is blocked on PTO for the moment.  I'll get an update on (b). Note that we've been leaning on an already-overloaded netops for higher-priority items on the infra list for a while now.
(status description copied to bug 617414 - sorry, this was the wrong place for it)
Let's see how things look after the P2P link is enabled.
Wrong bug
Blocks: 498425
I have no idea why this bug is still open and got lost going through 42 comments.  Help?
Assignee: zandr → mrz
This bug is still open because we only got part way through disabling external access for staging/preproduction slaves, primarily because we got blocked by bug 646076 not being possible. I'm going to say that this bug has probably outlived its usefulness though. We can file specific follow-up bugs to get access disabled for other machines, when ready.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INCOMPLETE
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: