Closed Bug 617414 Opened 9 years ago Closed 3 years ago

[Tracking bug] prevent RelEng systems contacting the outside world (block external network access)

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bhearsum, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: sheriffing-P1)

Attachments

(2 files)

...to help weed out tests that depend on the network.

Not sure exactly when we want to do this, but probably somewhere between Christmas and New Year's, and likely for half a day or so. This needs to be coordinated with developers who will be around to file/debug issues that come up. Cc'ing some potential people.
(In reply to comment #0)
> Not sure exactly when we want to do this, but probably somewhere between
> Christmas and New Year's, and likely for half a day or so. This needs to be
> coordinated with developers who will be around to file/debug issues that come
> up. Cc'ing some potential people.
I can likely help with this during that time.
would we get the same results if a developer unplugged their networking cable from a computer and tried to run the tests?
I wouldn't be surprised if we hit mozilla.org or .com favicon or something, so I'd interpret shutting off access to mean everything except the buildbot master, stage.m.o, graphs.m.o, and anything else I forget that we absolutely need to complete a job.
(In reply to comment #2)
> would we get the same results if a developer unplugged their networking cable
> from a computer and tried to run the tests?

This would likely be helpful, but I still think we should do this.
Looping in the netops guys to see what makes sense from the networking side.
As long as the hosts are within SCJ1 this is fairly simple to accomplish by adding a few policies on the firewall there.

Can we get a time when this is to be applied and for how long?
Initially, we would be looking at doing it for only a few hours -- say 4. The hosts in question are mostly in SCL, possibly a few left in MTV, but we could probably disable those for the trial period.

Eventually, we'd be looking at shutting off external network access to them permanently.
(In reply to comment #6)
> As long as the hosts are within SCJ1 this is fairly simple to accomplish by
> adding a few policies on the firewall there.
> 
> Can we get a time when this is to be applied and for how long?

would it be possible to get some logging turned out right now to see what these machines are hitting that's outside our network?
Ravi talked to me about this a few days ago and came up with the idea of just turning on logging for all external requests made from within the build network. This seems like a _much_ less disruptive option, and should get us a bunch of good data. I'm waiting for him to turn that on, and then we'll find a way to publish the data for all to look at.
Summary: shut off network access to build machines for a period of time → log external requests made by machines in the build network
Depends on: 625158
(In reply to comment #9)
> Ravi talked to me about this a few days ago and came up with the idea of just
> turning on logging for all external requests made from within the build
> network. This seems like a _much_ less disruptive option, and should get us a
> bunch of good data. I'm waiting for him to turn that on, and then we'll find a
> way to publish the data for all to look at.



After talking with bhearsum, to avoid confusion I'm morphing bug#617414 to track this project. Dependent bug#625158 is with IT to track the logging of outbound traffic. Once we get logs, we'll file dependent bugs to track fixing tests, construct whitelist of machines we need to continue to reach... and finally a dependent bug for IT to actually close the firewall.
OS: Linux → All
Summary: log external requests made by machines in the build network → [Tracking bug] prevent RelEng systems contacting the outside world except for whitelisted machines
Depends on: 626999
Priority: -- → P3
Hm.
I'm in Haxxor, connected to the build wifi, accessing wiki.m.o; will that show up as a releng system accessing the outside world?

What if I have to hit developer.nokia.com?
(In reply to comment #11)
> Hm.
> I'm in Haxxor, connected to the build wifi, accessing wiki.m.o; will that show
> up as a releng system accessing the outside world?
> 
> What if I have to hit developer.nokia.com?
We'll add developer.nokia.com to the whitelist... or am I missing something?
Should be fine if we do that.
What is the use case here?  What is the client in this context?  Devices being tested or humans?  If the latter, I'd be more inclined to have a squid proxy in place and there be a manual client side configuration to use it while in Haxxor.  My objective is to not have to maintain a whitelist on the FW, but rather permit a single IP (proxy) which allows anything on the build network.
We want to block all devices being tested from accessing the outside world (outside of a whitelist, to be whittled down to 0).

I was concerned that it would affect humans as well.
Seems to me that the releng machines should be totally cut off, and if you're a Real Human doing something, you'll should connect to another network, VPN, or move files through an intermediate.

Since we've been bitten by having people reference external resources from test machines, there ideally shouldn't be any whitelist unless absolutely necessary (eg uploading/downloading builds, or something like that).
(In reply to comment #16)
> Seems to me that the releng machines should be totally cut off, and if you're a
> Real Human doing something, you'll should connect to another network, VPN, or
> move files through an intermediate.
> 
> Since we've been bitten by having people reference external resources from test
> machines, there ideally shouldn't be any whitelist unless absolutely necessary
> (eg uploading/downloading builds, or something like that).

+1. 

Dolske, exactly right - once we have all the tests identified and cleaned up, we can close off access without burning the tree. See dep bugs for details. For extra goodness, this should resolve a few intermittent orange tests too, I suspect.
(In reply to comment #14)
> What is the use case here?  What is the client in this context?  Devices being
> tested or humans?  

A couple of different cases. Dolske nailed it that the test machines should be able to talk to few hosts if any. However, humans also work in that room, and we should probably make a different SSID available for that purpose (and VPN back into the releng network if you need to)

> My objective is to not have to maintain a whitelist on the FW, but
> rather permit a single IP (proxy) which allows anything on the build network.

If the humans are on a different SSID, then we should be able to keep the whitelist fairly static. That said, I'm not averse to a proxy just to move management of that whitelist out of netops and the firewall.
Per today's group meeting, bhearsum will go through the new logs provided in bug#625158, to look for what still wants to contact outside world. 

If any RelEng systems are doing this, we'll track fixing in depbugs of bug#617414.
If any of the tests are doing this, bhearsum will file depbugs under bug#626999. 


Once all these are resolved, we'll file 2 bugs with IT:

1) do one more logging run of the firewall, to make sure nothing new happening since the last check.
2) deny connections from RelEng systems to the outside world, except of course to a well-defined whitelist.
I started looking through the latest logs from bug 625158. I've verified that a bunch of them are blockable, still need to go through some others. The other good news is that I haven't found anything in these logs that indicate regular builds/tests access external sites. There's one reference to hans-moleman.w3.org, but I'd expect more if it was tests requesting them.

There's an open question about what to do with Mozilla mirrors, which we validly access as part of the Final Verification tests. We've been talking about ways to get rid of external mirror access for those, though, and that may work around the issue.

There's also an open question about all of the 63.245.* IPs, which are Mozilla owned as far as I can tell, but don't correspond to any real services. I think they are various load balancers and/or proxies. Not sure what the consequences of blocking those are.

I'm using https://spreadsheets.google.com/ccc?key=0AhwotqoBm6lEdDU3cjc4cmQ5SjdQWjVRRVo4Sjk3enc&hl=en as a tracking place for these, if anyone else wants to go through them.
Assignee: bhearsum → nobody
OS: All → Linux
Priority: P3 → --
(In reply to comment #20)
> I started looking through the latest logs from bug 625158. I've verified that a
> bunch of them are blockable, still need to go through some others. 
Leaving with bhearsum while he finishes that.

> The other
> good news is that I haven't found anything in these logs that indicate regular
> builds/tests access external sites. There's one reference to
> hans-moleman.w3.org, but I'd expect more if it was tests requesting them.

Maybe a developer who had been granted temp access to a RelEng machine was experimenting? How long ago was that?

 
> There's an open question about what to do with Mozilla mirrors, which we
> validly access as part of the Final Verification tests. We've been talking
> about ways to get rid of external mirror access for those, though, and that may
> work around the issue.

We can also whitelist those if we have a complete list.


> There's also an open question about all of the 63.245.* IPs, which are Mozilla
> owned as far as I can tell, but don't correspond to any real services. I think
> they are various load balancers and/or proxies. Not sure what the consequences
> of blocking those are.
If appropriate, we can whitelist these too.


zandr: Can IT identify these machines for us? 


> I'm using
> https://spreadsheets.google.com/ccc?key=0AhwotqoBm6lEdDU3cjc4cmQ5SjdQWjVRRVo4Sjk3enc&hl=en
> as a tracking place for these, if anyone else wants to go through them.
Assignee: nobody → bhearsum
(In reply to comment #21)
> zandr: Can IT identify these machines for us? 

https://spreadsheets.google.com/ccc?key=tCWzYw2MvtvS5ag57J7lflA#gid=0

I had to fork bhearsum's spreadsheet because I didn't have write permission.
(In reply to comment #21)
> (In reply to comment #20)
> > I started looking through the latest logs from bug 625158. I've verified that a
> > bunch of them are blockable, still need to go through some others. 
> Leaving with bhearsum while he finishes that.
> 
> > The other
> > good news is that I haven't found anything in these logs that indicate regular
> > builds/tests access external sites. There's one reference to
> > hans-moleman.w3.org, but I'd expect more if it was tests requesting them.
> 
> Maybe a developer who had been granted temp access to a RelEng machine was
> experimenting? How long ago was that?
> 
> 
> > There's an open question about what to do with Mozilla mirrors, which we
> > validly access as part of the Final Verification tests. We've been talking
> > about ways to get rid of external mirror access for those, though, and that may
> > work around the issue.
> 
> We can also whitelist those if we have a complete list.
> 
> 
> > There's also an open question about all of the 63.245.* IPs, which are Mozilla
> > owned as far as I can tell, but don't correspond to any real services. I think
> > they are various load balancers and/or proxies. Not sure what the consequences
> > of blocking those are.
> If appropriate, we can whitelist these too.
> 
> 
> zandr: Can IT identify these machines for us? 

The question wasn't so much identification, but rather whether or not we're planning to block external Mozilla IPs in general. However, I am surprised to see us accessing FTP and hg by their external IPs.
The last of the blocking items from bug 626999 is now fixed. The only bug left open is 628873, which doesn't need to be fixed before we shut off network access. So, I believe we're at a point now where we won't break any tests by shutting things off.
Just looked over the traffic log at the Mozilla IPs. The ones with a lot of hits are:
- 63.245.209.115/443 (services.addons.mozilla.org)
- 63.245.209.158/443 (AMO)
- 63.245.209.114/443 (getpersonas.com)
- 63.245.209.91/443 (AMO)
- 63.245.209.11/21 (mozilla.org)

Need to track down what part of our infra is accessing these.
(In reply to comment #25)
> Just looked over the traffic log at the Mozilla IPs. The ones with a lot of
> hits are:
> - 63.245.209.115/443 (services.addons.mozilla.org)

This is probably accessed in the background during Talos runs. Unittests fake it out with prefs, so it's unlikely to be happening there, too. I think we can safely block it.

> - 63.245.209.114/443 (getpersonas.com)

I couldn't find references to this in MXR, but we probably access it like the above via a different URL.

> - 63.245.209.158/443 (AMO)
> - 63.245.209.91/443 (AMO)

These are probably more instances of the addons manager running in the background.

> - 63.245.209.11/21 (mozilla.org)

Not sure about this one, maybe from first run pages or similar?
So, I think the analyzing part of this is done. We still need to:
- List out the machines that need to be whitelisted completely
- List out the external things that need whitelisting
- Figure out when we're flipping the switch.
Proposed plan:
Start blocking external services sometime during the All Hands (first week of April, exact day/time TBD based on the schedule).

Whitelist the following machines completely:
- *puppet*
- admin
- cruncher
- *master*
- *foopy*
- bm-remote-talos-webhost*
- slavealloc
- *opsi*
- *stage*

For the rest of the machines, whitelist all Mozilla mirror nodes.
Depends on: 646046
Depends on: 646056
Depends on: 646076
Things are moving forward in the dependent bugs again, here's the plan:
- Fix newly found tests that depend on the network (bugs 663211 and 663372), rerun tests afterwards.
- Run some test builds/release jobs once NetOps switches off internet access on staging machines in bug 646046.
- Analyse firewall logs also requested in bug 646046.

Once all the above is done, and there's no more test or build failures, we're ready to make the switch in production.

Also, updating the summary because we're not doing any whitelisting.
Summary: [Tracking bug] prevent RelEng systems contacting the outside world except for whitelisted machines → [Tracking bug] prevent RelEng systems contacting the outside world
bhearsum, no Internet access is great news.
(In reply to comment #29)
> Things are moving forward in the dependent bugs again, here's the plan:
> - Fix newly found tests that depend on the network (bugs 663211 and 663372),
> rerun tests afterwards.

bug 663211 is still waiting on a review, and checkin. However, that bug doesn't cause failed tests, so we don't need to block on it. bug 663372 is fixed now, and that test suite is fully green.

> - Run some test builds/release jobs once NetOps switches off internet access
> on staging machines in bug 646046.

I'm satisfied with the shut-off down for the test machines, so I've requested we shut things off for the build machines now, too. I'll be doing these tests next week.

> - Analyse firewall logs also requested in bug 646046.

The only thing of note in those logs was a bunch of hits to time.apple.com. I've fixed the machines doing that, details are in bug 646056.
Actually, the wiki page only indicates that it's *related* to bug 649422.
This project is currently waiting on
 (a) bouncer fixes (bug 646076)
 (b) new firewalls in mtv1 (part of bug 649422)

(a) is blocked on PTO for the moment.  I'll get an update on (b). Note that we've been leaning on an already-overloaded netops for higher-priority items on the infra list for a while now.
(In reply to comment #34)
> This project is currently waiting on
>  (a) bouncer fixes (bug 646076)
>  (b) new firewalls in mtv1 (part of bug 649422)
> 
> (a) is blocked on PTO for the moment.  I'll get an update on (b). Note that
> we've been leaning on an already-overloaded netops for higher-priority items
> on the infra list for a while now.

And per IRC, we _cannot_ do this at all in SJC1, because the firewalls are overloaded:
<dustin> so it looks like sjc1 is in the same position as mtv1: needs new firewalls before we can put on more ACL's.  However, sjc1 is full and slated to be evacuated once scl3 is online, so there's no plan to add more firewalls there.

So, given that we're going to be inconsistent between locations _anyways_, once a) and b) are addressed I'm going to ask for the network to be shut off to the SCL1 build network. We'll flip the switch on MTV1 at a later date, when we can. And we'll figure out what to do with the currently-in-SJC1 machines when we know when they're moving (maybe we'll just shut off their access from the get-go).
(In reply to comment #35)
> (In reply to comment #34)
> > This project is currently waiting on
> >  (a) bouncer fixes (bug 646076)
> >  (b) new firewalls in mtv1 (part of bug 649422)
> > 
> > (a) is blocked on PTO for the moment.  I'll get an update on (b). Note that
> > we've been leaning on an already-overloaded netops for higher-priority items
> > on the infra list for a while now.
> 
> So, given that we're going to be inconsistent between locations _anyways_,
> once a) and b) are addressed I'm going to ask for the network to be shut off
> to the SCL1 build network.

I'm not sure why I said we're going to wait on MTV1 firewalls before flipping the switch on SCL1. That's not the case - we're just waiting on bug 646076 for that.
Depends on: 696133
Just discovered another thing that uses the external network: weekly add-on tests. They downloading addons from places like: https://addons.mozilla.org/en-US/firefox/downloads/latest/722. Filed bug 696133 on it.
At ravi's prompting, I just had another look at outgoing sessions.  Since this project is stalled, I won't go into too much detail, but a few big points of note:

 *  "final verification" for a release involves HEADing all of the Firefox mirrors, which is a quickly-changing list

 *  there's a lot of outbound NTP traffic, probably from default host configs

 *  much of the traffic is to ftp.m.o, which resolves to an external address in the build view

 *  many of the sessions are to pulse, which is also external

I was surprised to hear that there was so much external traffic (40Mbps sustained, very spiky).  I had not counted on the latter two points, but they explain much of the load.  We could change the FTP path by changing how the name resolves; pulse is currently external-only, but we could conceivably multi-home it when the new hardware is installed.  I don't know how either choice would affect bandwidth consumption.
I'm surprised by the traffic to ftp.m.o, as I thought we were using stage.m.o for most things and that does have an internal address. Have you got any examples handy ?
[cltbld@linux64-ix-slave10 ~]$ host stage.mozilla.org
stage.mozilla.org is an alias for surf.mozilla.org.
surf.mozilla.org has address 10.2.74.116

[cltbld@linux64-ix-slave10 ~]$ host ftp.mozilla.org
ftp.mozilla.org is an alias for ftp.generic-zlb.sj.mozilla.com.
ftp.generic-zlb.sj.mozilla.com has address 63.245.209.137

I see traffic to ftp from both *-ix-* and talos-r3-* hosts.  Here are three example flows:

Nov 16 20:57:09 10.12.48.14/54127->63.245.209.137/80 63.245.222.66/41440->63.245.209.137/80
Nov 16 20:57:09 10.12.48.14/54125->63.245.209.137/80 63.245.222.66/63596->63.245.209.137/80
Nov 16 20:57:09 10.12.48.14/54126->63.245.209.137/80 63.245.222.66/9835->63.245.209.137/80

(these are SNAT'd - 63.245.222.66 is the public IP selected for the session)
As for all those requests, maybe I suggest to entirely separate the (release) build machines from the other release-related machines like running unit tests, upload, mirror check and the like?

Security-wise, what's critical is:
1. getting and verifying the source code
2. building (i.e. compile and get tar.bz2 / installer exe etc.)
3. signing that build (PGP and SSL signature for Windows)

Once the build is signed and won't be modified anymore, it's no longer security-sensitive, because anybody can check its integrity. It's only these 3 steps above where somebody can modify things without detection.

All the other tasks, *especially* running the test cases, are more involved and a lot more prone to attack. Tests run a *lot* of code (in fact, that's their very purpose), and there are often tests which intentionally or not trigger network access.
Likewise, mirror verification inherently involves a communicating with a lot of foreign hosts.

I realize that this would be quite a dramatic change, to not let build machines run unit tests. Please note that I propose this change only for releases (and nightlies, aurora, beta), not necessarily for tinderbox continuous builds.

But despite being dramatic, I think it's a necessary change.
BenB: part of the desire for restricting outside network access is prevent (or minimize) external factors from affecting test results. Over the years we've spent quite a bit of time chasing down test failures that turned out to be due to things like external web sites changing, or DNS of a remote domain being misconfigured. By preventing outbound access, these tests will _never_ pass on our infrastructure and so will need to be rewritten.

The security issues are a nice-to-have, but I think separate from this particular bug.
Comment 0 clearly states tests. Sorry, I was confused.

> The security issues are a nice-to-have, but I think separate from this particular bug.

OK. Will use another bug, then.
Depends on: 709531
Just found bug 709531, which talks about Addons Manager tests that depend on AMO. Based on the current DNS, I'm pretty sure fixing those blocks this bug :(
[cltbld@mv-moz2-linux-ix-slave06 ~]$ host addons.mozilla.org
addons.mozilla.org has address 63.245.217.112
(In reply to Ben Hearsum [:bhearsum] from comment #44)
> Just found bug 709531, which talks about Addons Manager tests that depend on
> AMO. Based on the current DNS, I'm pretty sure fixing those blocks this bug

These tests got disabled, so we don't have to wait for them to be fixed!
No longer depends on: 709531
Depends on: 745206
Depends on: 745299
Blocks: 721856
Working from the latest captures done in bug 745206, I have done the following:
- Morphed the data into a useful data format: nested Python dictionaries, saved in a pickle file
- Resolved most of the hostnames (both source and dest) (evidently, there was a few problems parsing some output or resolving some machines, resulting in NXDOMAIN or other weird output, most of them worked out OK though).
- Counted the number of src+dest+port combinations

Out of those, I filtered out the following:
ignorehosts = [
    '.*infra.scl1.mozilla.com',
    'dc01.winbuild.scl1.mozilla.com',
    'eval-mini-1.build.scl1.mozilla.com',
    'ganglia1.build.scl1.mozilla.com',
    'hg1.build.scl1.mozilla.com',
    'install2.build.scl1.mozilla.com',
    'pxe1.build.scl1.mozilla.com',
    'rabbit1-dev.build.scl1.mozilla.com',
    'dm-nagios01.mozilla.org',
    'ganglia1.dmz.scl3.mozilla.com',
    'arr-client-test.build.scl1.mozilla.com',
    'bm-vpn01.build.sjc1.mozilla.com',
    'dm-wwwbuild01.mozilla.org',
    'mvadmn01.mv.mozilla.com',
    'r5-mini-',
]

Most of those are infra machines. r5-mini-* are staging machines, so they'll have tons of random requests that don't happen as part of automation.

I also did a bunch of grouping to make the dataset smaller:
    if host.startswith('talos-r3-fed64'):
        return host[0:14]
    elif host.startswith('talos-r3-fed'):
        return host[0:12]
    elif host.startswith('talos-r3-leopard'):
        return host[0:16]
    elif host.startswith('talos-r4-lion') or host.startswith('talos-r4-snow'):
        return host[0:13]
    elif host.startswith('talos'):
        return host[0:11]
    elif host.startswith('t-r3-w764'):
        return u't-r3-w764'
    elif host.startswith('w64-ix-slave'):
        return u'w64-ix-slave'
    elif host.startswith('w32-ix-slave'):
        return u'w32-ix-slave'
    elif host.startswith('bld-lion-r5'):
        return u'bld-lion-r5'
    elif host.startswith('buildbot-master'):
        return u'buildbot-master'
    elif host.startswith('linux64-ix-slave'):
        return u'linux64-ix-slave'
    elif host.startswith('linux-ix-slave'):
        return u'linux-ix-slave'
    elif 'deploy.akamaitechnologies.com' in host:
        return u'deploy.akamaitechnologies.com'
    elif host.endswith('1e100.net'):
        return u'1e100.net'
    elif host.endswith('sjc.llnw.net'):
        return u'llnw.net'
    elif re.match('time.*\.apple.com', host):
        return u'time.apple.com'

Lastly, I filtered out anything with fewer than 10 requests.

Attached is the resulting dataset in the format:
(srchost, dsthost, port, # of requests)

I will comment inline with things of concern.
Depends on: 764500
Depends on: 764501
Depends on: 764504
Depends on: 764505
Depends on: 764507
Depends on: 764509
Comment on attachment 632761 [details]
scl1 data, grouped, filtered, counted

>(u'3(NXDOMAIN)', u'xmlrpc.rhn.redhat.com', '443', 349)
>(u'releng-puppet1.build.scl1.mozilla.com', u'xmlrpc.rhn.redhat.com', '443', 185)
>(u'slavealloc.build.scl1.mozilla.com', u'xmlrpc.rhn.redhat.com', '443', 51)

These look like machines in the RelEng network that are running RHEL. I don't know if they need to have access to this server to function correctly. Dustin/Amy?

>(u'bld-lion-r5', u'admin1a.build.scl1.mozilla.com', '123', 310)
>(u'bld-lion-r5', u'admin1b.build.scl1.mozilla.com', '123', 385)
>(u'buildbot-master', u'admin1a.build.scl1.mozilla.com', '123', 10)
>(u'buildbot-master', u'admin1b.build.scl1.mozilla.com', '123', 12)

This, and all of the other requests to admin1a/b are NTP. Given that the dst is inside the build network, I don't think we need to worry about this or other such requests. I won't be making further comment on them.

>(u'talos-r3-fed64', u'1e100.net', '443', 4750)
>(u'talos-r3-fed64', u'1e100.net', '80', 3135)

This article claims that Firefox safe browsing uses 1e100.net, so this is probably background requests to it: http://www.pcmech.com/article/the-mysterious-1e100-net/. Probably safe to block.

>(u'talos-r3-fed64', u'addons-star.zlb.phx.mozilla.net', '443', 12)
>(u'talos-r3-fed64', u'addons-versioncheck-bg.lb1.phx.mozilla.net', '443', 36)
>(u'talos-r3-fed64', u'addons.zlb.phx.mozilla.net', '443', 11)

This is probably Firefox contacting AMO during tests. Not sure if tests depend on this, or its just background checks. Probably ok to block though.

>(u'talos-r3-fed64', u'deploy.akamaitechnologies.com', '443', 207)

A million different things use akamai, I don't know how to pinpoint what application is making this request, but it should be safe to block in all cases.


BELOW THIS LINE ARE THINGS THAT WE NEED TO ADDRESS:
>(u'autoland-staging01.build.scl1.mozilla.com', u'worblehat.Math.HMC.Edu', '80', 12)
<snip>

Looks like autoland-staging01 is using external CentOS mirrors instead of mrepo. Needs to be fixed. bug 764500.

>(u'talos-r3-fed64', u'hg-zlb.vips.scl3.mozilla.com', '80', 56)
>(u'dev-master01.build.scl1.mozilla.com', u'hg-zlb.vips.scl3.mozilla.com', '80', 4484)

Looks like these machines have a different route to hg. May need to address this or whitelist it. No bug filed because I don't know what we should be doing.

>(u'master-puppet1.build.scl1.mozilla.com', u'sproweb03.spro.net.\nmirror01.spro.net', '80', 54)
>(u'scl-production-puppet.build.scl1.mozilla.com', u'linux.mirrors.es.net', '80', 50)

Looks like these machines needs to be pointed at mrepo. bug 764501.

>(u'releng-mirror01.build.scl1.mozilla.com', u'isc-sfo2-01.mozilla.org', '873', 246)

This is releng-mirror01 rsyncing. It syncs from stage-rsync.mozilla.org, which turns out to be this machine:
➜  manifests  host stage-rsync.mozilla.org
stage-rsync.mozilla.org is an alias for pv-mirror01.mozilla.org.
pv-mirror01.mozilla.org is an alias for isc-sfo2-01.mozilla.org.

Will need to do something about this. bug 674504.

>(u'signing1.build.scl1.mozilla.com', u'notarization.verisign.net', '80', 7746)
>(u'signing2.build.scl1.mozilla.com', u'notarization.verisign.net', '80', 8107)

This is signcode contacting the timestamp server as part of signing. This will need to be whitelisted. May need to change our servers to use this name instead of timstamp.verisign.com. bug 764505.

>(u'talos-r3-fed64', u'ftp.generic-zlb.sj.mozilla.com', '80', 87)

This is likely downloading of builds from ftp. Needs to be whitelisted. bug 764507.

>(u'talos-r3-fed64', u'graphs.generic-zlb.sj.mozilla.com', '80', 59)

Graph server posts, need to be whitelisted. bug 764507.


>(u'talos-r3-fed64', u'huxley.baz.com', '80', 939)
>(u'talos-r3-fed64', u'mirrors.rpmfusion.org', '80', 276)
>(u'talos-r3-fed64', u'mirror.us.leaseweb.net', '80', 350)

Looks like some or all of these machines are using some external yum mirrors. bug 764509.



That's all from scl1, still need to do scl3.
Depends on: 764514
Depends on: 764521
Comment on attachment 632798 [details]
scl3 data grouped, filtered, counted

>(u'bld-lion-r5', u'3(NXDOMAIN)', '443', 26723)
>(u'bld-lion-r5', u'3(NXDOMAIN)', '80', 1010)
>(u'bld-lion-r5', u'aus4-dev.zlb.phx.mozilla.net', '443', 271)

This is Balrog updates being submit, needs to continue to work. bug 764514

>(u'bld-lion-r5', u'clamav-du.viaverio.com', '80', 48)
>(u'bld-lion-r5', u'clamav.inoc.net', '80', 22)
>(u'bld-lion-r5', u'clamav-mirror.co.ru', '80', 20)
>(u'bld-lion-r5', u'clamav-sj.viaverio.com', '80', 39)
>(u'bld-lion-r5', u'clamav.swishmail.com', '80', 34)
>(u'bld-lion-r5', u'clamav.theshell.com', '80', 41)
>(u'bld-lion-r5', u'clamav.us.es', '80', 32)

It appears that clam is installed on this class of machines. I have no idea why, but we don't use it, so this is safe to block.

>(u'bld-lion-r5', u'geotrust-ocsp-ilg.verisign.net', '80', 40)
>(u'bld-lion-r5', u'geotrust-ocsp-mtv.verisign.net', '80', 42)

I think these are CRL servers. Blockable.

>(u'bld-lion-r5', u'mirror.netcologne.de', '80', 19)
>(u'bld-lion-r5', u'mirror-vip.cs.utah.edu', '80', 47)

Probably hit as part of final verify, will be fixed bug 646076.

>(u'buildbot-master', u'dp-pulse01.pub.phx1.mozilla.com', '5672', 441)

Pulse publishing, needs to continue to work. bug 764521.
(In reply to Ben Hearsum [:bhearsum] from comment #47)
> This is probably Firefox contacting AMO during tests. Not sure if tests
> depend on this, or its just background checks. Probably ok to block though.

Yep, all the extension and safebrowsing tests are intentionally hitting something local, anything hitting AMO or 1e100.net is just in a harness where we never thought to make sure they didn't get hit.
(In reply to Ben Hearsum [:bhearsum] from comment #47)
> >(u'3(NXDOMAIN)', u'xmlrpc.rhn.redhat.com', '443', 349)
> >(u'releng-puppet1.build.scl1.mozilla.com', u'xmlrpc.rhn.redhat.com', '443', 185)
> >(u'slavealloc.build.scl1.mozilla.com', u'xmlrpc.rhn.redhat.com', '443', 51)
>
> These look like machines in the RelEng network that are running RHEL. I
> don't know if they need to have access to this server to function correctly.
> Dustin/Amy?

I believe so:

/etc/sysconfig/rhn/up2date:serverURL=https://xmlrpc.rhn.redhat.com/XMLRPC

This is something we may end up changing, but for the moment, yes, whitelist it.

(In reply to Ben Hearsum [:bhearsum] from comment #49)
> It appears that clam is installed on this class of machines. I have no idea
> why, but we don't use it, so this is safe to block.

That's the mac's built-in virus-checking trying to update itself, I bet.
Depends on: 766919
(In reply to Dustin J. Mitchell [:dustin] from comment #51)
> (In reply to Ben Hearsum [:bhearsum] from comment #47)
> > >(u'3(NXDOMAIN)', u'xmlrpc.rhn.redhat.com', '443', 349)
> > >(u'releng-puppet1.build.scl1.mozilla.com', u'xmlrpc.rhn.redhat.com', '443', 185)
> > >(u'slavealloc.build.scl1.mozilla.com', u'xmlrpc.rhn.redhat.com', '443', 51)
> >
> > These look like machines in the RelEng network that are running RHEL. I
> > don't know if they need to have access to this server to function correctly.
> > Dustin/Amy?
> 
> I believe so:
> 
> /etc/sysconfig/rhn/up2date:serverURL=https://xmlrpc.rhn.redhat.com/XMLRPC
> 
> This is something we may end up changing, but for the moment, yes, whitelist
> it.

Thanks!

> (In reply to Ben Hearsum [:bhearsum] from comment #49)
> > It appears that clam is installed on this class of machines. I have no idea
> > why, but we don't use it, so this is safe to block.
> 
> That's the mac's built-in virus-checking trying to update itself, I bet.

Ahhhh, okay. Definitely blockable then.
(In reply to Dustin J. Mitchell [:dustin] from comment #51)
> (In reply to Ben Hearsum [:bhearsum] from comment #47)
> > >(u'3(NXDOMAIN)', u'xmlrpc.rhn.redhat.com', '443', 349)
> > >(u'releng-puppet1.build.scl1.mozilla.com', u'xmlrpc.rhn.redhat.com', '443', 185)
> > >(u'slavealloc.build.scl1.mozilla.com', u'xmlrpc.rhn.redhat.com', '443', 51)
> >
> > These look like machines in the RelEng network that are running RHEL. I
> > don't know if they need to have access to this server to function correctly.
> > Dustin/Amy?
> 
> I believe so:
> 
> /etc/sysconfig/rhn/up2date:serverURL=https://xmlrpc.rhn.redhat.com/XMLRPC
> 
> This is something we may end up changing, but for the moment, yes, whitelist
> it.

Since I just saw an e-mail pointing here, this one is Identified in Bug 766919 and should point at rhnproxy.mozilla.org instead.
Assignee: bhearsum → nobody
Adding bug 800866 due to bug 800866 comment 3 (to 6).
Depends on: 800866
Keywords: sheriffing-P1
Adding extra keywords to summary, since this is the second time I've struggled to find this bug :-)
Summary: [Tracking bug] prevent RelEng systems contacting the outside world → [Tracking bug] prevent RelEng systems contacting the outside world (block external network access)
Depends on: 840186
Depends on: 812342
Depends on: 882575
Depends on: 882610
No longer depends on: 882575
Depends on: 890832
Product: mozilla.org → Release Engineering
Bug 812342 has now reached its conclusion, however we are still able to do things like:

(In reply to Nick Thomas [:nthomas] from bug 812342 comment #73)
> [cltbld@talos-r3-fed-040 ~]$ curl -si http://www.zombocom.com
> HTTP/1.1 302 Found
> Cache-Control: private
> Content-Type: text/html; charset=utf-8
> Location: http://www.hugedomains.com/domain_profile.cfm?d=zombocom&e=com
> Server: Microsoft-IIS/8.0
> X-Powered-By: ASP.NET
> Date: Sun, 02 Feb 2014 21:34:20 GMT
> Content-Length: 183
> 
> <html><head><title>Object moved</title></head><body>
> <h2>Object moved to <a
> href="http://www.hugedomains.com/domain_profile.cfm?d=zombocom&amp;
> e=com">here</a>.</h2>
> </body></html>
> 
> ie a test slave in scl1 being able to make a port 80 reqeust on the
> internetz.


Due to:

(In reply to Michael Henry [:tinfoil] from bug 812342 comment #74)
> The *DEFAULT* firewall policy is Deny-All, but then there are exceptions
> allowed based on traffic usage.  Such exceptions include 80 and 443 to the
> internet.
> 
> Port 80 and 443 traffic to the internet was never filtered because it was
> still very actively being used at the time of analysis.  In addition the
> traffic analysis we had at the time was based strongly on IP, but not name
> requests (what the firewall sees).  Because of that in many cases we could
> not analyze what websites were being hit because most did not have accurate
> reverse dns.  Also since many webites being accessed were being hosted in
> sites like AWS, their IP addresses are subject to change without warning,
> making firewall policies impossible to maintain.
> 
> My suggestion was to move all 80 and 443 traffic from releng to a proxy so
> there could be analysis based on names and further restrictions can be
> applied.

I don't mind what implementation we go with, so long as non-whitelisted 80 and 443 connections are denied - but we need to do something more here, before we slowly introduce more hidden external dependencies.

Is there someone that would be up for driving this? :-)
Chris, how far away are we from being able to deny by default all external access? I really think this bug will help us with preventing intermittent test failures due to relying to outside content (eg as found recently by the speculative connections caused intermittents).
Flags: needinfo?(catlee)
Also bug 995041 now.
Depends on: 995041
Depends on: 995417
Depends on: 995599
Depends on: 995806
Depends on: 995995
Depends on: 996009
Depends on: 996019
Depends on: 996031
Blocks: 996504
Depends on: 992611
Depends on: 994302
Depends on: 996871
Depends on: 1004682
Flags: needinfo?(catlee)
Component: Other → General Automation
QA Contact: catlee
Depends on: 1026958
No longer depends on: 994302
Depends on: 1030093
Depends on: 1027113
Depends on: 1030111
Depends on: 1030149
Depends on: 1033472
Anyone know what's going on with this bug and its deps?
Flags: needinfo?(hwine)
Flags: needinfo?(catlee)
Flags: needinfo?(arich)
Flags: needinfo?(mpurzynski)
Most of RelEng systems have access to Internet, I guess? That is very unfortunate. Can we start working on the changes, so they don't? That would be true for both AWS and SCL3.
Flags: needinfo?(mpurzynski)
Yes, basically.  This bug predates my time at Mozilla!

As I understand it, the plan was to add a logging default permit to the 'untrust' zone, and then whittle away at what we see logged there -- either whitelisting or blocking -- until nothing's left.  Then changing it to a default deny.  That was tracked in bug 812342.

To be honest, I thought we were stalled out in the whittling stage, but now that I look at the flows on fw1.releng.scl3 and review bug 812342, it seems we have a default deny in place -- just with a huge exception for 80/443:

dustin@fw1.ops.releng.scl3.mozilla.net> show configuration groups global-policies security policies from-zone <*> to-zone untrust policy any--http-812342 
match {
    source-address any;
    destination-address any;
    application [ junos-http junos-https ];
}
then {
    permit;
}

There are three remaining blockers for this bug --  bug 882610, bug 1004682, bug 1026958 -- all of which probably indicate tests that will fail if we remove that policy.  Bug 812342 suggests there will be others as well.  The suggestion was to enable proxies, but QA has had a difficult time getting that working in the QA VLAN, since every OS and every application looks in a slightly different place for proxy configuration -- and I suspect Firefox is the worst offender!

And in the last few years, we've introduced a wrinkle: we've started sending some data directly to the public Internet from AWS.  In fact, it seems like this has become policy sometime in the last few months, although I haven't been able to get word from John Bircher or James Barnell on the topic aside from a reference in a bug.

Now there's the additional wrinkle that TaskCluster hosts don't, as far as I know, even run in a VPC that's connected to the Mozilla network, and seem to allow all outgoing traffic.

It would be really nice to decide where we want to go from here, but that's above my pay-grade.
Flags: needinfo?(hwine)
Flags: needinfo?(catlee)
Flags: needinfo?(arich)
We won't be doing any more work here in buildbot. I'm not sure if we'll be doing anything in particular in this space in taskcluster.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.