Closed Bug 1189520 Opened 10 years ago Closed 9 years ago

decom the IT sentry/errormill instance

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fox2mike, Assigned: ericz)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1488] )

Sentry/Errormill is in phx1 and we need to decide if it goes to scl3 or dies in phx1. This is a tracker.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1488]
During the great spring cleaning event Errormill was without an owner. At the time we were rolling out newrelic, which can do error collection, and new relic was replacing Errormill on a number of sites. My projects are low traffic and rely on daily cron tasks, so newrelic's error sampling could result in missing critical problems. A few other projects had their own reasons for wanting to keep it around. I volunteered to take over maintenance and was given root on the boxes. I've kept it up to date with security upgrades only, discouraged new users, and have been unwilling to accept the risk of a major version upgrade. In planning to migrate things out of PHX and into AWS the plans involved cloudops (oremj) standing up a new sentry instance for sites that moved. This would allow us to kill errormill without migration. It no longer looks like this is an option, since our current plan (As I understand it) is to migrate mostly PHX -> SCL instead. I suspect we will need to keep Errormill running for now.
bmo uses errormill. > During the great spring cleaning event Errormill was without an owner. At > the time we were rolling out newrelic, which can do error collection, and > new relic was replacing Errormill on a number of sites. unfortunately new relic's support of perl applications is lackluster to non-existent for production systems. > I suspect we will need to keep Errormill running for now. +1
shyam, do you know when a decision will be made on this? i'd like as much time as possible to source alternatives if errormill is going away.
Flags: needinfo?(smani)
Byron, We met for a bit today, it'll be supported for a while. You should look at alternatives, but we're planning on security fixing and keeping it going for now.
Flags: needinfo?(smani)
We are running out of time to upgrade and instead we will shut it off and recommend moving if you still want sentry. Cloud Ops will host anyone who wants to keep using a sentry instance. I announced to the webdev extravaganza last month that this was coming but I didn't have details worked out yet. I'm sending an e-mail now to everyone who has logged into errormill in the last year since there is no better list. It contains steps to migrate to the new sentry instance.
Assignee: server-ops-webops → chris.lonnen
Summary: Decide future of sentry/errormill → decom the IT sentry/errormill instance
MDN is now officially off of errormill.
We confirmed out of bug that BMO is no longer on errormill, so we're going to proceed with shutting this off now.
Assignee: chris.lonnen → server-ops-webops
Assignee: server-ops-webops → nmaul
sentry[12].webapp.phx1.mozilla.com powered off sentry[12].db.phx.mozilla.com powered off ZLB backend DB VIPs (ro/rw) in PHX1 disabled Nagios alerts for above removed (thanks :sal) RabbitMQ connections, queue, vhost, and user removed ZLB frontend TIG, Vserver, pool removed
LDAP Group: modules/ldap_users/manifests/groups/vpn/vpn_errormill.pp Can be removed now. NI: @limed ... can you take care of this? Or feel free to pass off to any LDAP admin :)
Flags: needinfo?(limed)
A gaggle of things removed from puppet and inventory. Some remainders for Monday: hiera/secrets/site.yaml:443:secrets_crashstats_stage_errormill_apikey modules/socorro/* modules/webapp/templates/socorro-stage/etc-socorro/local.py.erb Those all reference errormill in their configs and thus probably haven't worked since the VIP was initially shut off by atoll a month ago. modules/socorro/manifests/access.pp IS IN USE, but AFAICT the rest of the socorro module is not and can be safely removed. I recommend moving the stuff in access.pp into the lone node definition (it's just user/sudo stuff, nothing socorro-specific at all). Once all those are cleaned up, I believe we'll be done here.
Note that there used to be a severe bug in Puppet where doing user/sudo stuff in a node definition *rather than* in modules/* could cause one user's SSH keys to overwrite another user's SSH keys, so make sure Puppet runs cleanly on the host after moving user/sudo stuff.
(In reply to Jake Maul [:jakem] from comment #11) > LDAP Group: > modules/ldap_users/manifests/groups/vpn/vpn_errormill.pp > Can be removed now. > NI: @limed ... can you take care of this? Or feel free to pass off to any > LDAP admin :) For this file a bug under infra::ldap to have the vpn group removed in ldap and it will kill the group automatically for us.
Flags: needinfo?(limed)
Looks like Nagios is still configured with the Zeus VIPs: 'sentry-ro-vip.db.scl3.mozilla.com' => { parents => 'zlb1.ops.scl3.mozilla.com', hostgroups => [ 'zeus-vips' ] }, 'sentry-rw-vip.db.scl3.mozilla.com' => { parents => 'zlb1.ops.scl3.mozilla.com', hostgroups => [ 'zeus-vips' ] }, Also, please check Zeus. AND make sure it's on the decom checklist that any dbs need to have their Zeus VIP and pools removed. This often gets missed.
Assignee: nmaul → server-ops-webops
I just removed sentry[12].db.phx1.mozilla.com from puppetdashboard (not sure why they're still there, the boxes were decom'd, but they were flagging as having an older version of MySQL, so...I removed them) You may want to check https://puppetdashboard.mozilla.org/hosts?utf8=%E2%9C%93&search=sentry (and perhaps search errormill too) for what hasn't been taken out of puppetdashboard. Also...what other decom steps were missed/skipped? (ugh, sorry)
Assignee: server-ops-webops → smani
Assignee: smani → server-ops-webops
Assignee: server-ops-webops → smani
Assignee: smani → server-ops-webops
Assignee: server-ops-webops → smani
Assignee: smani → server-ops-webops
Assignee: server-ops-webops → eziegenhorn
All sentry zeus VIPs have been removed. Sentry puppet classes have been removed. Nagios checks removed. Lots of old DNS entries (sjc1 anyone?) have been cleaned up. There are 5 remaining boxes in inventory, 4 dead and 1 alive that I've made separate bugs for the MOC to kill and properly clean up. Otherwise, I believe this is done.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.