[CAB] Migrate Pulse to the cloud

RESOLVED WONTFIX

Status

Infrastructure & Operations
Change Requests
RESOLVED WONTFIX
3 years ago
3 years ago

People

(Reporter: mcote, Unassigned)

Tracking

Details

(Reporter)

Description

3 years ago
We would like to migrate Pulse (RabbitMQ cluster) and PulseGuardian (Pulse management web app) over to CloudAMQP and Heroku, respectively.  See bug 1205867 for more.

Date, time, duration of maintenance:

Next TCW, ideally 2015/10/17.  Time would ideally be after 2:00 EDT.

System(s) affected:

Mainly pulse.mozilla.org (RabbitMQ cluster and PulseGuardian web app), but all systems that write to or read from Pulse as well (Buildbot, Taskcluster, Bugzilla, and a few others).

End-user impact:

Publishers may be unable to write messages for the duration (an hour or so).  Subscribers will be disconnected and will have to reconnect, and the new DNS entry for pulse.mozilla.org will have to propagate out to all systems.

Maintenance plan and timeline (link to a wiki or etherpad is fine):

https://docs.google.com/document/d/1F207nMJUXXxyDNuJuoPDfFzK39RSy0gOqrYMR-21AcQ/edit# (step 7)

Rollback plan / rollback point (at which point will you determine to roll back):

If we cannot verify that systems can write to and read from the new cluster (steps 7g-i), we will have to reset DNS and open the ports on pulse.mozilla.org back up.  There should be no fallout from this aside from potentially some lost messages, which is expected in any case.

Notification mechanisms:

Not entirely sure what this means, but I'll be on vidyo and IRC.

Who will be point, who else will be involved:

mcote, with support from jgriffin
Flags: cab-review?
(In reply to Mark Côté [:mcote] from comment #0)
> 
> Next TCW, ideally 2015/10/17.  Time would ideally be after 2:00 EDT.

Next TCW is 2015-10-10 - the google doc has events scheduled for the preceeding Friday -- can everything be moved to 2015-10-09 & 2015-10-10 ?

> Notification mechanisms:
> 
> Not entirely sure what this means, but I'll be on vidyo and IRC.

It means what user groups need advance notice. Releng and Taskcluster teams you mentioned. Anyone else?

Also, not critical to the TCW, but needed after the cutover is successful:
 - update netflows for new IP address(es) serving pulse.m.o
 - update netflows to remove old IP addresses serving pulse.zlb.phx
Could you add that to your checklist, please?
Flags: needinfo?(mcote)
(Reporter)

Comment 2

3 years ago
(In reply to Hal Wine [:hwine] (use NI) from comment #1)
> (In reply to Mark Côté [:mcote] from comment #0)
> > 
> > Next TCW, ideally 2015/10/17.  Time would ideally be after 2:00 EDT.
> 
> Next TCW is 2015-10-10 - the google doc has events scheduled for the
> preceeding Friday -- can everything be moved to 2015-10-09 & 2015-10-10 ?

As mentioned in the TCW, that's Canadian Thanksgiving weekend, so not a great time for myself (or any other Canadians who might be involved in this TCW).  However, it's possible that we can do this migration the day before.  It'll mean some lost messages if Buildbot is still chugging along, but that might be fine with people, given that this *should* be the last planned outage, for some time at least.  I'll check with a few people and get back to you shortly.

> > Notification mechanisms:
> > 
> > Not entirely sure what this means, but I'll be on vidyo and IRC.
> 
> It means what user groups need advance notice. Releng and Taskcluster teams
> you mentioned. Anyone else?

A-Team, plus a few individuals who run Pulse-related services like glandium (will get back to you with a better list).

> Also, not critical to the TCW, but needed after the cutover is successful:
>  - update netflows for new IP address(es) serving pulse.m.o

Sorry not clear on this one--presumably we'd want to add any necessary netflows for talking to Pulse's new location *before* the window to ensure a smooth cutover, right?  As I have filed for Buildbot, for example.

>  - update netflows to remove old IP addresses serving pulse.zlb.phx

Added.

Leaving NI since I have some follow-up items to do.
(Reporter)

Comment 3

3 years ago
Looks like we'll be doing this outside of a TCW after all, so I'm going to resolve this.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(mcote)
Flags: cab-review?
Resolution: --- → WONTFIX
Reviewed 9/30 CAB - Approved for 10/7
Flags: cab-review+

Updated

3 years ago
Cab Review: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.