Closed Bug 891128 Opened 11 years ago Closed 11 years ago

Multiple services are offline in SCL3

Categories

(Infrastructure & Operations Graveyard :: NetOps: Other, task)

All
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fox2mike, Unassigned)

References

Details

(Whiteboard: [reit-ops][reit-b2g])

Still ongoing, working to resolve
Assignee: server-ops → shyam
Group: infra → mozilla-corporation-confidential
Summary: Multiple services outage in SCL3 → Multiple services are offline in SCL3
List of affected services :

1) Bugzilla (should be back online now)
2) hg.mozilla.org
3) git.mozilla.org
Group: mozilla-corporation-confidential
Whiteboard: [reit-ops][reit-b2g]
Done on bugzilla4 and the backups server and bugzilla1.db.phx1:

slave stop; 
 change master to master_host='bugzilla2.db.scl3.mozilla.com', master_log_pos=921472505, master_log_file='bugzilla2-bin.000322';
slave start;

needs to be done on bugzilla3 when it comes up, too. (bugzilla1 is already slaving bugzilla2)

bugzilla1.db.phx1 can't replicate because of netflows:

[root@bugzilla1.db.phx1 ~]# nc -vz bugzilla2.db.scl3.mozilla.com 3306
nc: connect to bugzilla2.db.scl3.mozilla.com port 3306 (tcp) failed: Connection timed out
Whiteboard: [reit-ops][reit-b2g]
Whiteboard: [reit-ops][reit-b2g]
Assignee: shyam → ravi
Severity: blocker → normal
Component: Server Operations → NetOps: Other
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → ravi
Switched back to bugzilla1.db.scl3 as the master; if this write succeeds then everything is OK again.
trees closed: 14:47PT
trees opened: 18:16PT

The Firefox23.0b4 release was impacted, those builds have been revived, and are now in progress... details of cleanup in bug#891165.

Like before, lets keep this bug open for root cause/postmortem.
We have AJTAC and BNG case 2013-0708-1012 open with Juniper and a call this morning to sync up on the issue.  We believe this may be a similar issue to bug 826609.  We believe we were able to collect full debug output and logs to help AJTAC identify a root cause.
Status: NEW → ASSIGNED
QA Contact: ravi → adam
Assignee: ravi → network-operations
Juniper's case remains open, however, as they have not been able to track down a root cause of the issue for the past year, we are electing to remove the technology. the depending bug is the tracker for that work.  Previously in the netops roadmap, we were going to install this device in phx1 as well, however, that is no longer on the table due to instability we've experienced in scl3, which would is a similar layout.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.