Closed Bug 1026102 Opened 10 years ago Closed 9 years ago

Exhaustively, authoritatively document all releng BU flows

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1114] )

We need a way to document each and every flow that touches the releng BU or the releng AWS VPCs.

This should be exhaustive, and should be something we can, in principle, automatically verify to correspond to the configuration.  That means not prose, but some kind of machine-parseable, but human-readable, format.
Blocks: 1026112
Depends on: 1029599
Here's what I've got so far:

https://github.com/djmitche/bug1026102/blob/master/parse_juniper.py

I'd like to ultimately transform this into a data structure that can authoritatively answer "permit" or "deny" for any tuple (ip, ip, application).

Rather than then try to turn that into something human-readable, I have a better idea.  /cc ulfr because, in a way, this is the same idea as mig: write rules to verify the infrastructure.

In particular, how cool would it be to write a set of unit tests for flows?

def test_buildmaster_db():
    assert permit(buildmaster_ips, host('buildbot-rw-vip.db.scl3.mozilla.com'), 'mysql')
    assert permit(buildmaster_ips, host('buildbot-ro-vip.db.scl3.mozilla.com'), 'mysql')

The question is, how to define the negatives -- we don't want to try to write "assert deny(..)" for every possible denied flow!

Instead, I'm thinking we can organize that around applications:

def test_incoming_to_puppetmasters():
    # limit the universe of applications to consider
    assert incoming_applications(puppetmaster_ips) == set(['ssh', 'http', 'https', 'puppet', 'rsync'])
    assert sources_of(puppetmaster_ips, 'ssh') == releng_networks
    assert sources_of(puppetmaster_ips, 'http') == releng_networks + scl3_external_zeus_cluster
    assert sources_of(puppetmaster_ips, 'https') == releng_networks
    assert sources_of(puppetmaster_ips, 'puppet') == releng_networks
    assert sources_of(puppetmaster_ips, 'rsync') == scl3_external_zeus_cluster

Taken together, that completely characterizes the incoming flows to puppetmasters, such that any deviations -- either excessive permits or excessive denies -- will fail the test.

If we wrote a test to characterize incoming and outgoing flows from each host type, I think we'd be complete, but still pretty readable.

The overall idea here is that we publish the unit tests for releng's flows somewhere, and that changes to the flows come with a diff to the unit tests (along with the usual request).

Right now running these tests involves a running few commands manually on the fw and some scp's.  I'm open to suggestions for an easier way to handle that.

This is intended as a status report on my work so far, but I'm excited to hear what everyone thinks of this idea.  I'm happy to get feedback here or by email.
OK, this is done and ready to start writing tests.  Please see

  https://github.com/djmitche/fwunit

The basic operation here is, you get three XML files from your firewall, you feed them to 'fwunit-prep', which digests them (takes ~3 minutes for fw1.releng.scl3).  Then you run your unit tests.

The unit tests aren't included in the repo, but here's what I was using to test:

from fwunit.ip import IP, IPSet
from fwunit.tests import Rules

fw1 = Rules('rules.pkl')

releng_net = IPSet([
    IP('10.26.48.0/22'),
    # TODO more..
    IP('10.132.0.0/16'),
    IP('10.134.0.0/16'),
])
other_internal = IPSet([IP('10.0.0.0/8')]) - releng_net

puppetmasters = IPSet([IP(ip) for ip in
    '10.26.48.45',      # releng-puppet1.srv.releng.scl3.mozilla.com
    '10.26.48.50',      # releng-puppet2.srv.releng.scl3.mozilla.com
    '10.132.48.212',    # releng-puppet1.srv.releng.usw2.mozilla.com
    '10.132.48.229',    # releng-puppet2.srv.releng.usw2.mozilla.com
    '10.134.48.57',     # releng-puppet1.srv.releng.use1.mozilla.com
    '10.134.49.5',      # releng-puppet2.srv.releng.use1.mozilla.com
])

def test_puppetmaster_access():
    for app in 'puppet', 'junos-http', 'junos-https':
        fw1.assertPermits(releng_net, puppetmasters, app)
Group: infra
So, I need help with three things:

 1. How can I script the process of grabbing the data from the firewall?  In particular, 'show route' and 'show security policies' aren't just config excerpts, so I don't think rancid is helpful.

 2. Do the tests need to be kept private?

 3. What will be the easiest way to use these tests to define and check policy changes?  Should they be checked in somewhere?  Run automatically, or on-demand?  By who?

Happy to hear thoughts here, in email, or to meet.  Please also feel free to cc others who might be more interested here.
on netops-dev1.private.scl3.mozilla.com
there is a utility in my home directory called "mm"
short of "maintain many"

It gives you an example of how to run show commands on juniper devices
using the PyEZ netconf module.  

That might provide you with the "show" functionality you're looking for.
HTHs.
I am absolutely impressed by this. I would love to discuss it more over vidyo whenever you have one hour to spare.

>  2. Do the tests need to be kept private?
Yes. While our security does not depend the disclosure of configurations, there is no reason to make the job easier to hackers ;)
The mig-actions repository is also private for the same reason. And I'd love to open it up. But there's too many "what if".

> 3. What will be the easiest way to use these tests to define and check policy changes?
> Should they be checked in somewhere?  Run automatically, or on-demand?  By who?

I would recommend reusing what is done for MIG. At the bottom of MIG's API doc, there's a format description for compliance items: https://github.com/mozilla/mig/blob/master/doc/api.rst
MozDef can store these items, and they can be rendered in Kibana. Any change in state (compliance becomes false) can generate an alert of your choice.
(In reply to Dustin J. Mitchell [:dustin] from comment #3)
> So, I need help with three things:
> 
>  1. How can I script the process of grabbing the data from the firewall?  In
> particular, 'show route' and 'show security policies' aren't just config
> excerpts, so I don't think rancid is helpful.

netconf is the way to go, it will output XML of configuration and allow you to send commands to the firewall and receive structure answer. Good news - netconf is in fact SSH with a 'netconf' shell. Of course the commands are structured as well.

> 
>  2. Do the tests need to be kept private?

Yes, we don't want to publish our firewalls configuration.

> 
>  3. What will be the easiest way to use these tests to define and check
> policy changes?  Should they be checked in somewhere?  Run automatically, or
> on-demand?  By who?

NetOps has a configuration repository, so a git hook with 'run the check, email validation results' could be helpful.

https://mana.mozilla.org/wiki/display/NETOPS/Configuration+Archival

Basically network device sends configuration to the FTP server and a script run from cron commits it to the git repo.

> 
> Happy to hear thoughts here, in email, or to meet.  Please also feel free to
> cc others who might be more interested here.

0. Holly crap! You wrote parsing and compliance checking in 2 days!

Random ideas follow, with numbers so it's easier to reference them, but there's no any special order.

1. web interface that people can query before they go to NetOps asking to open a flow that's already opened (but you host based firewall blocks that)

2. parsing local netfilter data (mig can fetch it)

3. how do you deal with routing? If I ping from SCL3 to PHX1 the packets travel through the 'dc' zone outbound and then 'dc' zone again in PHX1 inbound. The mechanism must know which zone will be used - needs to read and parse and RIB.

4. unit tests for entire path - from host based firewall, through two firewalls (SCL3+PHX1) and host based on the destination
> 
> The question is, how to define the negatives -- we don't want to try to
> write "assert deny(..)" for every possible denied flow!
> 
> Instead, I'm thinking we can organize that around applications:
> 
> def test_incoming_to_puppetmasters():
>     # limit the universe of applications to consider
>     assert incoming_applications(puppetmaster_ips) == set(['ssh', 'http',
> 'https', 'puppet', 'rsync'])
>     assert sources_of(puppetmaster_ips, 'ssh') == releng_networks
>     assert sources_of(puppetmaster_ips, 'http') == releng_networks +
> scl3_external_zeus_cluster
>     assert sources_of(puppetmaster_ips, 'https') == releng_networks
>     assert sources_of(puppetmaster_ips, 'puppet') == releng_networks
>     assert sources_of(puppetmaster_ips, 'rsync') ==
> scl3_external_zeus_cluster
> 
> Taken together, that completely characterizes the incoming flows to
> puppetmasters, such that any deviations -- either excessive permits or
> excessive denies -- will fail the test.
> 

Idea how to solve that - write gradually more generic tests to be run automatically after the specific tests - without user writing a line of code, let it be automatic, like this:

say the last line is

    assert sources_of(puppetmaster_ips, 'rsync') == scl3_external_zeus_cluster

then you test some implicit rules, without writing then for each unit test

// fail on pass
    assert sources_of(puppetmaster_ips, 'rsync') == our_10_slash_8_without_releng
// fail on pass - "any" most likely opened
    assert sources_of(puppetmaster_ips, 'rsync') == 8.8.8.8
Depends on: 1044034
Depends on: 1044035
Depends on: 1044042
OK, so I added some sub-bugs for various bits.  Overall, the plan is:

* Run fwunit-prep on some private host that has access to all firewalls to generate rule sets.
* Run a set of unit tests against the rule sets periodicially and on-demand
  * The unit tests are in a git repository that's more widely accessible than the fw configs
* Run a set of compliance checks against the rule sets periodically, and report to MozDef

To the points raised above:

PyEZ looks great, but requires an SSH password.  I assume netconf is the same (link?).  Is there a role account with r/o access that I could use, or should I set one up? 

It looks like the Configuration Archival stuff only includes configuration; this particular project reqiures the set of in-place policies (show security policies, not show configuration security policies), as those have all of the group inheritance already worked out into a single sequence.

As for a web interface, that would add another layer of abstraction on top of this one (either trying to represent the tests graphically, or trying to represent the flows graphically -- both hard).  It would also be more tricky to secure against unauthorized access, and loses the benefits of tracking changes, looking at history, etc.  So, I think the user interface for this data is 'git', which is reasonably familiar to most folks.

As for netfilter data -- I don't want to get to the level of host firewalls, or of measuring actual flows of actual bits.  The main rationale for this project is the parent bug, simplifying the process of changing flow configuration.

And finally, routing along the entire path -- this is a great question.  My plan is to assume the current standard continues to hold: traffic is denied on the last firewall along the flow.  So if you want to write tests for access to a buildbot master in scl3, you write those tests against the rules for fw1.releng.scl3, but if you want to write tests about access to a DB server in scl3, you write those tests against the rules for fw1.scl3.  The same could be true of phx1, if releng had anything there.


So the remaining questions are:
 * Where should I host this?  Is netops2 appropriate?
 * Is this actually going to help? (I need to talk to releng and netops about this); and
 * How, exactly, should we run this to make it most useful? (again, need to talk to releng and netops)
> So the remaining questions are:
>  * Where should I host this?  Is netops2 appropriate?

tufin1.private.scl3.mozilla.com is the best place to run it - it already has login access to all of our network gear.

A read only account also exists with the name tufin, you can reuse it or copy and create your own.
Totally loving this bug. Brilliant idea Dustin!
Notes from my meeting with dividehex, ben, Callek, jlund, and pmoore:

* It'd be useful to have a releng-only nagios alert for when tests fail, especially if the process involves landing the (failing) test before the flow is put in.  The visibility of the test going green is then a good indication of completion.
* Documentation of the process should be clear and well-distributed
* We need a way to make the dumped pkl files available to relengers for writing new tests.  I don't want to check them in, because they'll make the git repo huge, so maybe this is just an SSH download from tufin1 or something like that.
Depends on: 1046160
I need to re-focus this: I've gotten distracted with the idea of running tests to verify configurations, and that's not the point.  The point is to have flows that are documented and auditable -- and it turns out that unit tests are a good way to accomplish that.

Running the tests periodically and alerting when they fail is a good way to keep us honest -- but that can be daily, at some time when netops is unlikely to be in the midst of a change.

Asking anyone in releng who requests a new flow to write unit tests, and asking everyone in netops to be ready to understand a diff to the tests, is unreasonable.  So I'd like to route all flow requests through me (which is very nearly the case already), and once I've got the process practiced, I'll train some additional relengers so that there's some redundancy.
Group: infra
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/387]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/387] → [kanban:engops:https://kanbanize.com/ctrl_board/6/390]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/390] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1102] [kanban:engops:https://kanbanize.com/ctrl_board/6/390]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1102] [kanban:engops:https://kanbanize.com/ctrl_board/6/390] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1112] [kanban:engops:https://kanbanize.com/ctrl_board/6/390] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1112] [kanban:engops:https://kanbanize.com/ctrl_board/6/390] [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1113] [kanban:engops:https://kanbanize.com/ctrl_board/6/390]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1113] [kanban:engops:https://kanbanize.com/ctrl_board/6/390] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1114] [kanban:engops:https://kanbanize.com/ctrl_board/6/390]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1114] [kanban:engops:https://kanbanize.com/ctrl_board/6/390] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1114]
Depends on: 1110883
The remaining bit here is to output these flow tests to a webpage of some sort, and allow relengers to access that page.
Depends on: 1118445
yay, tracker complete
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.