Closed Bug 1495917 Opened 6 years ago Closed 5 years ago

Clean up of scl3 from securitygroups

Categories

(Release Engineering :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: apop, Assigned: bcrisan)

References

Details

Attachments

(1 file, 2 obsolete files)

After adding the scriptworkers into mdc1, we received a lot of broken checks from nagios-releng-mdc1.
Then, we have found some scl3 entries under securitygroups.yml
Blocks: 1484880
while checking the securitygroups.yml, I have found many entries related scl3 and some related to aws
Entries like this :

# all slave VLANs look the same
    slave-vlan-inbound:
        - proto: tcp
          ports: [22]
          hosts:
            - {include: slaveapi-servers}
            - aws-manager1.srv.releng.scl3.mozilla.com
            - aws-manager2.srv.releng.scl3.mozilla.com

aws-manager-ssh:
      proto: tcp
      ports: [22]
      hosts:
        - aws-manager1.srv.releng.scl3.mozilla.com
        - aws-manager2.srv.releng.scl3.mozilla.com

Are these still needed ?
Flags: needinfo?(nthomas)
The answer isn't 100% certain. We have certainly no longer need slave-vlan-inbound, and aws-manager1 and aws-manager2 from scl3 are gone. 

aws-manager-ssh seems to be used so that puppet can be run after the initial instance creation. We may need to create a new aws-manager style of instance to create new instances, mostly scriptworker. jlorenzo was going to try running the code locally to see if that works OK.
Flags: needinfo?(nthomas) → needinfo?(jlorenzo)
Maybe a better question should be : What lines we still need and which ones we don't ?
Flags: needinfo?(nthomas)
Flags: needinfo?(jwatkins)
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #2)
> jlorenzo was going to try running the code locally to see if that works OK.

I found a workaround which doesn't require me to spin a machine up. So, I haven't looked closely at running the script locally. However, Mihai got some instructions from Aki which bypasses aws-manager, too. Did I get this correctly, Mihai? If so, I'm in favor of removing these entries.
Flags: needinfo?(jlorenzo) → needinfo?(mtabara)
(In reply to Johan Lorenzo [:jlorenzo] from comment #4)
> (In reply to Nick Thomas [:nthomas] (UTC+13) from comment #2)
> > jlorenzo was going to try running the code locally to see if that works OK.
> 
> I found a workaround which doesn't require me to spin a machine up. So, I
> haven't looked closely at running the script locally. However, Mihai got
> some instructions from Aki which bypasses aws-manager, too. Did I get this
> correctly, Mihai? If so, I'm in favor of removing these entries.

Yes, Aki gave me some instructions which I want to test. If that works, we're safe to forget about the aws-manager as we'd have some backup solutions to ramp-up scriptworkers. However, I don't have headspace to test those until the end of this week or beginning of next week.

If this bug is not urgent, maybe you can block it on me until I fully test those instructions? 
If this is needed sooner, I think we can maybe find someone else to give those a try?
Thanks for the context Mihai!

aws-manager is already decommissioned. So is slc3. I don't think we should keep the vlan/ssh config around. Worst case scenario, if comment 5 doesn't work, we can still look back at the git history to craft something similar. 

Please remove the configuration related to aws-manager, Adrian. Please let me know if you need more information.
Flags: needinfo?(nthomas)
Flags: needinfo?(mtabara)
Flags: needinfo?(jwatkins)
Flags: needinfo?(apop)
(In reply to Johan Lorenzo [:jlorenzo] from comment #6)
> Thanks for the context Mihai!
> 
> aws-manager is already decommissioned. So is slc3. I don't think we should
> keep the vlan/ssh config around. Worst case scenario, if comment 5 doesn't
> work, we can still look back at the git history to craft something similar. 
> 
> Please remove the configuration related to aws-manager, Adrian. Please let
> me know if you need more information.

+1
(In reply to Johan Lorenzo [:jlorenzo] from comment #6)

> Please remove the configuration related to aws-manager, Adrian. Please let
> me know if you need more information.

Okay, reading through the bug, it seems like we were always confident removing: slave-vlan-inbound, and aws-manager1 and aws-manager2

and Johan/Mihai confirmed that we can also remove aws-manager-ssh which addresses Nick's concerns.
Assignee: nobody → apop
Flags: needinfo?(apop)
Please check and review
Please check the patch that I've attached and review it.

Thank you
Flags: needinfo?(mtabara)
Flags: needinfo?(jlorenzo)
(In reply to Adrian Pop from comment #11)
> Please check the patch that I've attached and review it.
> 
> Thank you

Sure, will do. 

Btw, you can set r? directly on the patch by going to "Details" and asking "r?" + ":mtabara: or ":jlorenzo" so that the review request links to the patch. In this bug it's quite easy to spot it, but in a bug with more patches, it'd be confusing a bit.
Comment on attachment 9016913 [details] [diff] [review]
patch for removing lines related to aws hosts with scl3 as data center

Review of attachment 9016913 [details] [diff] [review]:
-----------------------------------------------------------------

::: configs/securitygroups.yml
@@ +328,4 @@
>            ports:
>              - {include: buildbot-http-portrange}
>            hosts:
> +            - {include: slaveapi-servers}            

Tiny nit: are there some trailing spaces here?
Attachment #9016913 - Flags: review+
Flags: needinfo?(mtabara)
Comment on attachment 9016913 [details] [diff] [review]
patch for removing lines related to aws hosts with scl3 as data center

LGTM. Same comment as Mihai.
Flags: needinfo?(jlorenzo)
Attachment #9016913 - Flags: review+
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #13)
> Comment on attachment 9016913 [details] [diff] [review]
> patch for removing lines related to aws hosts with scl3 as data center
> 
> Review of attachment 9016913 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: configs/securitygroups.yml
> @@ +328,4 @@
> >            ports:
> >              - {include: buildbot-http-portrange}
> >            hosts:
> > +            - {include: slaveapi-servers}            
> 
> Tiny nit: are there some trailing spaces here?

I've checked and I couldn't see them. 

About the other entries related to scl3 and scl3 host. I'll create a different patch and add it here for the review
Attached patch bug_1495917_2.patch (obsolete) — Splinter Review
removed the other hosts that had scl3 in their names and a few entries that had scl3.
Attachment #9019222 - Flags: review?(nthomas)
Attachment #9019222 - Flags: review?(jwatkins)
Please check and review my last patch. Thank you
Comment on attachment 9019222 [details] [diff] [review]
bug_1495917_2.patch

Review of attachment 9019222 [details] [diff] [review]:
-----------------------------------------------------------------

It would be much easier to do these reviews in a github pull request.  If you submit a PR, you can attach it to the bug by simply pasting the PR URL to the attachment box.  A github pull request would allow myself, and other reviewers, to inspect the diff as an entire file.  That would give a much better chance of reviews catching errors.

::: configs/securitygroups.yml
@@ +31,1 @@
>            - nagios1.private.releng.mdc1.mozilla.com

The mdc2 nagios host should also be listed here

@@ +31,3 @@
>            - nagios1.private.releng.mdc1.mozilla.com
>  
>      # all slave VLANs look the same

If you are removing the entire slave-vlan-inbound group, you should also remove its comment.

@@ -179,1 @@
>      slaveapi-servers: 10.26.48.16/31

slaveapi died with scl3 and should also be removed.  In fact, anywhere you see the cidr 10.26.x.x/x, it should be removed since that was scl3s network range.

@@ +279,4 @@
>          - proto: tcp
>            ports: [22, 3389, 5900]
>            hosts:
> +

This becomes invalid without at least one host listed.  Find out the mdc1 and mdc2 equivalents and add them.

@@ +348,1 @@
>              - nagios1.private.releng.mdc1.mozilla.com

The MDC2 nagios host should also be listed

@@ +379,1 @@
>              - nagios1.private.releng.mdc1.mozilla.com

Missing mdc2 nagios host

@@ +402,1 @@
>              - nagios1.private.releng.mdc1.mozilla.com

missing mdc2 nagios host

@@ +421,4 @@
>              - 10.26.0.0/16
>              - 10.130.0.0/16
>              - 10.132.0.0/16
> +            - 10.134.0.0/16            

Trailing whitespace and missing mdc2 nagios host

@@ +425,2 @@
>              - nagios1.private.releng.mdc1.mozilla.com
>          - include: aws-manager-ssh

aws-manager-ssh was removed in this patch, it should be removed everywhere else it is referenced.
Attachment #9019222 - Flags: review?(jwatkins) → review-
Attachment #9019222 - Flags: review?(nthomas)
With suggestions provided above, I made this commit (in my fork) https://github.com/bccrisan/build-cloud-tools/commit/faee3f8b9bf5ba1b310e2d477792f7c34ca5d5c5

There are few definitions that we may not need anymore and I'm not sure about them.
Do we still need this:
- outbound-ssh-upload-try
- test-slave-vlan-outbound
- try-slave-vlan-outbound
- build-slave-vlan-outbound
- slave-vlan-outbound
- loaners
- buildbot-master
Flags: needinfo?(nthomas)
Flags: needinfo?(jwatkins)
(In reply to Bogdan Crisan [:bcrisan] (GMT+2) from comment #19)
> With suggestions provided above, I made this commit (in my fork)
> https://github.com/bccrisan/build-cloud-tools/commit/
> faee3f8b9bf5ba1b310e2d477792f7c34ca5d5c5
> 
> There are few definitions that we may not need anymore and I'm not sure
> about them.
> Do we still need this:
> - outbound-ssh-upload-try
> - test-slave-vlan-outbound
> - try-slave-vlan-outbound
> - build-slave-vlan-outbound
> - slave-vlan-outbound
> - loaners
> - buildbot-master
These also:
  - buildbot-http-portrange: 8000-8999
  - buildbot-rpc-portrange: 9000-9999
(In reply to Bogdan Crisan [:bcrisan] (GMT+2) from comment #19)
> With suggestions provided above, I made this commit (in my fork)
> https://github.com/bccrisan/build-cloud-tools/commit/
> faee3f8b9bf5ba1b310e2d477792f7c34ca5d5c5
> 
> There are few definitions that we may not need anymore and I'm not sure
> about them.
> Do we still need this:
> - outbound-ssh-upload-try
> - test-slave-vlan-outbound
> - try-slave-vlan-outbound
> - build-slave-vlan-outbound
> - slave-vlan-outbound
> - loaners
> - buildbot-master

Honestly, I'm not sure what is still needed either.  Maybe someone in Releng would have a better sense.  Since buildbot is dead it seems like we could rip out a bunch of things.  Make sure if things are ripped out, there are no other ec2 instances inadvertently using the security groups being removed.  And make sure you remove any references to the removed definitions.
Flags: needinfo?(jwatkins)
(In reply to Jake Watkins [:dividehex] from comment #21)
> (In reply to Bogdan Crisan [:bcrisan] (GMT+2) from comment #19)
> > With suggestions provided above, I made this commit (in my fork)
> > https://github.com/bccrisan/build-cloud-tools/commit/
> > faee3f8b9bf5ba1b310e2d477792f7c34ca5d5c5
> > 
> > There are few definitions that we may not need anymore and I'm not sure
> > about them.
> > Do we still need this:
> > - outbound-ssh-upload-try
> > - test-slave-vlan-outbound
> > - try-slave-vlan-outbound
> > - build-slave-vlan-outbound
> > - slave-vlan-outbound
> > - loaners
> > - buildbot-master
> 
> Honestly, I'm not sure what is still needed either.  Maybe someone in Releng
> would have a better sense.  Since buildbot is dead it seems like we could
> rip out a bunch of things.  Make sure if things are ripped out, there are no
> other ec2 instances inadvertently using the security groups being removed. 
> And make sure you remove any references to the removed definitions.

the *slave* and *master* ones I would imagine can be removed however, I'd be concerned that other things use these. Not sure about outbound-ssh-upload-try. Nick, do you know?
Mostly we can remove all of that, eg outbound-ssh-upload-build and outbound-ssh-upload-try are no longer needed.

We will need to keep buildbot-master because of bm01 and bm77 running l10n-bumper and bouncer checks, but it can be shortened to 

buildbot-master:
    description: security group for buildbot masters
    regions:
        us-west-1: vpc-7a7dd613
        us-west-2: vpc-cd63f2a4
        us-east-1: vpc-b42100df
    inbound:
        # generic stuff
        - include: jumphost-admin-access
        - include: nagios-nrpe
        - include: global-ping
    outbound:
        # TODO: bug 1210137
        - include: global-any

Other things you could remove: proxxy, any include which isn't used by anything
Flags: needinfo?(nthomas)
So far I removed:
 - outbound-ssh-upload-build, 
 - outbound-ssh-upload-try, 
 - test-slave-vlan-outbound, 
 - try-slave-vlan-outbound, 
 - build-slave-vlan-outbound, 
 - slave-vlan-outbound, 
 - tests, 
 - build, 
 - try, 
 - loaners

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #23)
> Mostly we can remove all of that, eg outbound-ssh-upload-build and
> outbound-ssh-upload-try are no longer needed.
That has been removed in https://github.com/bccrisan/build-cloud-tools/commit/24ca09cfac57b6a88ad510c2f2357cddad7d9fe8
 
> We will need to keep buildbot-master because of bm01 and bm77 running
> l10n-bumper and bouncer checks, but it can be shortened to 
I was expecting that
 
> buildbot-master:
>     description: security group for buildbot masters
>     regions:
>         us-west-1: vpc-7a7dd613
>         us-west-2: vpc-cd63f2a4
>         us-east-1: vpc-b42100df
>     inbound:
>         # generic stuff
>         - include: jumphost-admin-access
>         - include: nagios-nrpe
>         - include: global-ping
>     outbound:
>         # TODO: bug 1210137
>         - include: global-any
Replaced with ^^ in commit https://github.com/bccrisan/build-cloud-tools/commit/24ca09cfac57b6a88ad510c2f2357cddad7d9fe8


> Other things you could remove: proxxy, any include which isn't used by
> anything
I already removed a bunch of scl3 related proxxy's in https://github.com/bccrisan/build-cloud-tools/commit/faee3f8b9bf5ba1b310e2d477792f7c34ca5d5c5#diff-57d617afa8ef2d0089e9ce90bae279f4L177

I'm not entirely sure what others could be removed and are unused.

Also, dividehex pointed out a few things (slave-vlan-inbound being removed but still referenced in several places) just realized that those places we may not need them (was referenced in tests, build, and try), those got removed and the vpn network and ssh jumphost should be replaced with mdc1/2 equivalents, which again, not necessary since "tests" were removed.

bcrisan: is there more clean up here required? Sounds like in comment 24 we resolved most known scl3 references. Should we resolve and follow up in subsequent bug if others find more old references?

Flags: needinfo?(bcrisan)

I think I covered all of the changes in comment 24, should I do a PR from my fork? (I'm not sure if it's safe since I did not updated it since the last time I worked on it)
Or create a patch file and attach it here?

Assignee: apop → bcrisan
Flags: needinfo?(bcrisan) → needinfo?(jlund)

you can create a PR and link the PR URL in a Bugzilla attachment here.

Flags: needinfo?(jlund)
Attached file GitHub Pull Request

Created PR #373

Attachment #9016913 - Attachment is obsolete: true
Attachment #9019222 - Attachment is obsolete: true

Merged and no conflict appeared meanwhile.
This can be marked as fixed. Please reopen if something else needs to be done.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: