Closed
Bug 712695
Opened 13 years ago
Closed 13 years ago
brief HG outage 2011-12-21
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bear, Unassigned)
Details
(Whiteboard: [buildduty][outage])
at 0950 PST hg (and also svn) were unresponsive dumitru responded and they are back online from #sysadmins: [12:49] <nagios-sjc1> [66] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50] <nagios-sjc1> [68] dm-hg02:https_cert - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50] <nagios-sjc1> [70] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:50] <dumitru> what the [12:50] <nagios-sjc1> [74] dm-svn02:health is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds. [12:50] <dumitru> fuuuu [12:50] <nagios-sjc1> [76] dm-svn02:https - svn.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51] <nagios-sjc1> [78] dm-svn02:http - svn.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58] <nagios-sjc1> dm-svn02:https_cert - svn.mozilla.org is OK: OK - Certificate will expire on 11/23/2013 19:01. [12:58] <nagios-sjc1> dm-hg02:https - hg.mozilla.org is OK: HTTP OK: HTTP/1.1 200 Script output follows - 17348 bytes in 0.255 second response time [12:59] <nagios-sjc1> dm-hg02:http - hg.mozilla.org is OK: HTTP OK: HTTP/1.1 200 Script output follows - 17348 bytes in 0.115 second response time [12:59] <nagios-sjc1> dm-hg02:https_cert - hg.mozilla.org is OK: OK - Certificate will expire on 11/24/2013 07:50. [12:59] <nagios-sjc1> dm-svn02:https - svn.mozilla.org is OK: HTTP OK: HTTP/1.1 200 OK - 826 bytes in 0.084 second response time [12:59] <nagios-sjc1> dm-svn02:health is OK: OK - System: proliant dl360 g4, S/N: USM523048F, ROM: P52 04/14/2005, hardware working fine [13:00] <nagios-sjc1> dm-svn02:http - svn.mozilla.org is OK: HTTP OK: HTTP/1.1 200 OK - 826 bytes in 0.008 second response time [13:01] <nagios-sjc1> dm-svn02:avg load is OK: OK - load average: 0.86, 0.46, 0.18
Reporter | ||
Comment 1•13 years ago
|
||
while the hg server was fixed, hg syncing seems to be in the doldrums - poked bkero about it and he is looking into it [14:17] <nagios-sjc1> [90] hg1.build.scl1:Mercurial mirror sync - /releases/mozilla-aurora is CRITICAL: sync data is stale. 5602 seconds [14:17] <nagios-sjc1> [91] hg1.build.scl1:Mercurial mirror sync - /mozilla-central is CRITICAL: sync data is stale. 5600 seconds [14:17] <nagios-sjc1> [94] hg1.build.scl1:Mercurial mirror sync - /releases/mozilla-beta is CRITICAL: sync data is stale. 5621 seconds [14:18] <nagios-sjc1> [95] hg1.build.scl1:Mercurial mirror sync - /try is CRITICAL: sync data is stale. 5660 seconds [14:18] <nagios-sjc1> [96] hg1.build.scl1:Mercurial mirror sync - /hgcustom/hg_templates is CRITICAL: sync data is stale. 5660 seconds [14:18] <nagios-sjc1> [97] hg1.build.scl1:Mercurial mirror sync - /build/tools is CRITICAL: sync data is stale. 5680 seconds [14:18] <nagios-sjc1> [98] hg1.build.scl1:Mercurial mirror sync - /integration/mozilla-inbound is CRITICAL: sync data is stale. 5684 seconds [14:18] <nagios-sjc1> [99] hg1.build.scl1:Mercurial mirror sync - /hgcustom/hghooks is CRITICAL: sync data is stale. 5711 seconds [14:18] <nagios-sjc1> [00] hg1.build.scl1:Mercurial mirror sync - /build/buildbot-configs is CRITICAL: sync data is stale. 5711 seconds
Comment 2•13 years ago
|
||
We continued to get these: Subject: ** PROBLEM alert - hg1.build.scl1/Mercurial mirror sync - /hgcustom/hghooks is CRITICAL ** Date: Wed, 21 Dec 2011 13:06:54 -0800 (PST) From: nagios@dm-nagios01.mozilla.org (nagios) ***** Nagios ***** Notification Type: PROBLEM Service: Mercurial mirror sync - /hgcustom/hghooks Host: hg1.build.scl1 Address: 10.12.51.200 State: CRITICAL Date/Time: 12-21-2011 13:06:54 Additional Info: sync data is stale. 12191 seconds ....and we just now got a nagios recovery alert -------- Original Message -------- Subject: ** RECOVERY alert - hg1.build.scl1/Mercurial mirror sync - /hgcustom/hg_templates is OK ** Date: Wed, 21 Dec 2011 13:18:02 -0800 (PST) From: nagios@dm-nagios01.mozilla.org (nagios) ***** Nagios ***** Notification Type: RECOVERY Service: Mercurial mirror sync - /hgcustom/hg_templates Host: hg1.build.scl1 Address: 10.12.51.200 State: OK Date/Time: 12-21-2011 13:18:02 Additional Info: SYNC OK
Comment 3•13 years ago
|
||
I fixed those - the monitoring script had stalled, with no actual impact (mirrors were up to date, although releng systems are resilient to out-of-date mirrors anyway). Should we make those alert less often? Aside from that, is there anything left in this bug?
Reporter | ||
Comment 4•13 years ago
|
||
I don't think the alerts are an issue if we have the means to fix it :) nothing left that I'm aware of in this context
Comment 5•13 years ago
|
||
Awesome. For the record, it's something oncall can fix if you can't raise anyone from relops. We're still hunting the bug that periodically causes the mirror-sync errors (which is completely unrelated to the bug we saw here).
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•13 years ago
|
Whiteboard: [buildduty][outage]
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•