Closed Bug 894085 Opened 11 years ago Closed 11 years ago

addons.mozilla.org and marketplace.firefox.com production hardware migration

Categories

(Infrastructure & Operations :: Change Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jason, Assigned: jason)

References

Details

(Whiteboard: July 23rd, 2013, 16:00 PDT, 4 hours)

* date, time, duration of maintenance 
July 17th, 2013, 10:00 AM PDT, 4 hours

* system(s) affected
Systems associated with:
addons.mozilla.org
services.addons.mozilla.org
marketplace.firefox.com

* end-user impact 
addons.mozilla.org will be in read only state. Users will not be able add or modify addons.mozilla.org content.
marketplace will be unavailable and users will be redirected to hardhat.mozilla.net.

* maintenance plan and timeline
https://infra.etherpad.mozilla.org/marketplace-cutover-20130717

* rollback plan / rollback point
https://infra.etherpad.mozilla.org/marketplace-cutover-20130717

* notification mechanisms
Email to everyone@mozilla.org.

* who will be point, who else will be involved 
Jason Thomas and Jeremy Orem
Flags: cab-review?
Can we do this around 4pm or so PDT?  That would get us to midnight in Europe which is where most of the marketplace users are
Folks - I understand and support the need to move to the dedicated new HA hardware.  But, I'm expecting this is the LAST TIME that we take this cluster down for maintenance as the users are starting to ramp (phones are selling) and we are moving to an environment where we are online all the time.

Thanks for making this happen.

Rick.
Is this an emergency move?  I see the request is for a move on the 17th, the CAB meets to approve this on the 17th at 9:00am, and we ask for 2 weeks notice on a change[1].  Given that this is a new process and we are ironing out the kinks I am happy to make an exception, but even so same-day approval after a CAB meeting might be tough if there are exceptions, not to mention no time to make appropriate notifications as this window is less than 48 hours from now.

[1] - https://wiki.mozilla.org/IT/Maintenance#Approval_Process
Bug 888989 needs done in this window as well to add redundancy.  We were assuming that object storage would be on S3 after this move so we haven't pushed for this previously, but that not being the case it needs done during the downtime.   Thanks!
(In reply to Corey Shields [:cshields] from comment #4)
> Bug 888989 needs done in this window as well to add redundancy.  We were
> assuming that object storage would be on S3 after this move so we haven't
> pushed for this previously, but that not being the case it needs done during
> the downtime.   Thanks!

We have been working with gcox and will move to the new volume during the migration.

(In reply to Corey Shields [:cshields] from comment #3)
> Is this an emergency move?  I see the request is for a move on the 17th, the
> CAB meets to approve this on the 17th at 9:00am, and we ask for 2 weeks
> notice on a change[1].  Given that this is a new process and we are ironing
> out the kinks I am happy to make an exception, but even so same-day approval
> after a CAB meeting might be tough if there are exceptions, not to mention
> no time to make appropriate notifications as this window is less than 48
> hours from now.
> 
> [1] - https://wiki.mozilla.org/IT/Maintenance#Approval_Process

Sorry about this, we found out we needed to file a CAB on 7/9 and we need to hit a rushed date to get everything moved over before new markets come online.
Cool, emailing the CAB members to bring this to their attention now, sooner than the meeting.
This bug mentions "hardware migration". I'm sorry that I can't gauge more context from this, but will someone fill me in? Are you moving from VMs to hardware? AWS to hardware? Old hardware to "dedicated new HA hardware"? 

Thanks.
Old hardware to new hardware.
(In reply to Wil Clouser [:clouserw] from comment #1)
> Can we do this around 4pm or so PDT?  That would get us to midnight in
> Europe which is where most of the marketplace users are

:oremj, any objections to starting at 4pm per Wil's suggestion?
(In reply to Mark Mayo [:mmayo] from comment #9)
> :oremj, any objections to starting at 4pm per Wil's suggestion?

4pm works for us. We need to confirm the time with dbops and the storage team tomorrow.
dbeng can do 4 pm tomorrow.
Team 

Though I do not want to wait - I think going live is risky given the activities we have going on in production

- Ensuring receipt checking passes QA is vital (Krupa)
- We are currently stabilizing Poland in prod
- Colombia is going live this week in prod

I think next Tuesday after 4pm would be good - it's after Colombia payments live, but before Colombia go to market. 

other thoughts?
Tuesday @ 4pm works for us. Has Rick approved the time change?
Rick has delegated to Caitlin.  Just make it middle of the night in Europe.
Alright, it's settled. We'll execute the cutover plan 7/23 @ 4pm.
Whiteboard: July 23rd, 2013, 16:00 PDT, 4 hours
Flags: cab-review? → cab-review+
This was completed yesterday.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
(In reply to Jason Thomas [:jason] from comment #16)
> This was completed yesterday.

There were some misses from the Bi/DW side - intake of data from ad-ons to hadoop. Re-opening so that Daniel can confirm all is done and run-books are updated to include this tie to data warehousing.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Daniel, has this been fixed by now?
We are getting what appears to be 100% of the traffic, however, there is still an open issue with netscaler not supporting the inclusion of the DNT header value in the access logs.  That also impacts the inclusion of B2G device details in marketplace logs, but we have a workaround in place for those where we are collecting logs directly from the webheads rather than netscaler.

I think it would also be useful to have a nagios alert of some sort to detect a drop in log volume from the netscaler log collection daemon.  That might be something that already exists.
Anurag, Annie - where are we with closing this one?
SylvieV - 

addons.mozilla.org:
We are getting 100% logs for AMO, header information is still missing. Jason is working with Citrix in terms of getting a software update to start collecting the headers for AMO for DNT data and will update us when the patch is in place.

marketplace.firefox.com:
This is done. Logs for m.f.c are being collected via nginx w/ header support and pushed to our filers every night.
(In reply to Anurag Phadke[:aphadke@mozilla.com] from comment #21)
> addons.mozilla.org:
> We are getting 100% logs for AMO, header information is still missing. Jason
> is working with Citrix in terms of getting a software update to start
> collecting the headers for AMO for DNT data and will update us when the
> patch is in place.

We are waiting on Netscaler firmware 11.0 to patch issues with stability in Netscaler 10.1 (bug 900984). Citrix has not provided a final release date but support has stated that it might be out within the next two months. Bug 897732 opened to track progress.
Netscaler firmware was upgraded to latest stable release (bug 929110) which included support for custom headers. DNT headers have been configured for addons.mozilla.org netscaler logs.
Assignee: server-ops → jthomas
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.