Closed Bug 748814 Opened 13 years ago Closed 13 years ago

Tracking bug for Apr 25 2012 downtime

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: bear)

References

Details

(Whiteboard: [buildduty][downtime])

he Mozilla IT and RelEng teams need to take a downtime on Wednesday, April 25th to migrate some services that support the Firefox continuous integration system to a new data center. The user-facing systems are: * build API (and builddata) * clobberer * OPSI configuration servers for the build farm * trychooser The IT team would also like to use the downtime to upgrade some systems and services to provide better performance and/or scalability, notably: * re-balancing the ganeti cluster in our scl1 colo * fixing DHCP in our scl1 colo * reorganizing our minis and reconfiguring the switch to which are attached in our mtv1 colo * upgrading our rabbitmq installation * deploying a new pair of databases for buildbot * moving CVS to the new scl3 colo * upgrading zimbra The downtime is scheduled for 3 hours, starting at 09:00 PST. The trees will be closed during that time. We will open the trees and inform #developers as soon as possible after the maintenance is complete. As always, please let RelEng/myself know ASAP if there is any reason we should not proceed with this downtime.
Assignee: nobody → bear
Whiteboard: [buildduty][downtime]
timeline from todays downtime 0857 rail starts closing trees 0903 bear gives all clear for IT to start 0926 rail is stopping all build masters to allow db change 0923 arr finishes dhcp migration in scl1 0927 arr had to reboot redis01 0928 redis01 up 0929 cruncher has been transitioned by dustin 0934 rabbitmq upgrade started by dustin 0936 confirmed all masters are down and config changes being made for db and relengweb01 update 0941 arr reports all ganeti moves are done 0948 mburns reports production-opsi migrated - networking changes in progress 0957 production-opsi up and running 1001 relengweb1 is cutover and ready for testing 1005 mtv1 minis back online 1032 sheeri reports db cutover done, waiting on catlee's confirmation 1037 dustin reports rabbitmq updated 1053 catlee reports db cutover tested ok 1100 exploring why multiple linux slaves are unable to connect to the scl1 puppet master 1115 arr rebooted puppet master scl1 and its running but clients are still timing out 1139 ravi testing firewall rollover 1150 non-scl3 masters are being started 1150 ravi bouncing releng.scl3 vpn 1156 schedulers db needed updating 1159 linux slaves in scl1 are still having puppetd issues - iptables "hack" is helping 1210 downtime done - two issues need post downtime work on the releng side 1219 trees opened need to file a post-downtime bug for the scl1 puppet problem
See Also: → 748906
All done here, trees are open.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
See Also: 748906
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.