Closed Bug 712695 Opened 13 years ago Closed 13 years ago

brief HG outage 2011-12-21

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bear, Unassigned)

Details

(Whiteboard: [buildduty][outage])

at 0950 PST hg (and also svn) were unresponsive

dumitru responded and they are back online

from #sysadmins:

[12:49]  <nagios-sjc1> [66] dm-hg02:https - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:50]  <nagios-sjc1> [68] dm-hg02:https_cert - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:50]  <nagios-sjc1> [70] dm-hg02:http - hg.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:50]  <dumitru> what the
[12:50]  <nagios-sjc1> [74] dm-svn02:health is CRITICAL: CHECK_NRPE: Socket timeout after 30 seconds.
[12:50]  <dumitru> fuuuu
[12:50]  <nagios-sjc1> [76] dm-svn02:https - svn.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:51]  <nagios-sjc1> [78] dm-svn02:http - svn.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:58]  <nagios-sjc1> dm-svn02:https_cert - svn.mozilla.org is OK: OK - Certificate will expire on 11/23/2013 19:01.
[12:58]  <nagios-sjc1> dm-hg02:https - hg.mozilla.org is OK: HTTP OK: HTTP/1.1 200 Script output follows - 17348 bytes in 0.255 second response time
[12:59]  <nagios-sjc1> dm-hg02:http - hg.mozilla.org is OK: HTTP OK: HTTP/1.1 200 Script output follows - 17348 bytes in 0.115 second response time
[12:59]  <nagios-sjc1> dm-hg02:https_cert - hg.mozilla.org is OK: OK - Certificate will expire on 11/24/2013 07:50.
[12:59]  <nagios-sjc1> dm-svn02:https - svn.mozilla.org is OK: HTTP OK: HTTP/1.1 200 OK - 826 bytes in 0.084 second response time
[12:59]  <nagios-sjc1> dm-svn02:health is OK: OK - System: proliant dl360 g4, S/N: USM523048F, ROM: P52 04/14/2005, hardware working fine
[13:00]  <nagios-sjc1> dm-svn02:http - svn.mozilla.org is OK: HTTP OK: HTTP/1.1 200 OK - 826 bytes in 0.008 second response time
[13:01]  <nagios-sjc1> dm-svn02:avg load is OK: OK - load average: 0.86, 0.46, 0.18
while the hg server was fixed, hg syncing seems to be in the doldrums - poked bkero about it and he is looking into it

[14:17]  <nagios-sjc1> [90] hg1.build.scl1:Mercurial mirror sync - /releases/mozilla-aurora is CRITICAL: sync data is stale. 5602 seconds
[14:17]  <nagios-sjc1> [91] hg1.build.scl1:Mercurial mirror sync - /mozilla-central is CRITICAL: sync data is stale. 5600 seconds
[14:17]  <nagios-sjc1> [94] hg1.build.scl1:Mercurial mirror sync - /releases/mozilla-beta is CRITICAL: sync data is stale. 5621 seconds
[14:18]  <nagios-sjc1> [95] hg1.build.scl1:Mercurial mirror sync - /try is CRITICAL: sync data is stale. 5660 seconds
[14:18]  <nagios-sjc1> [96] hg1.build.scl1:Mercurial mirror sync - /hgcustom/hg_templates is CRITICAL: sync data is stale. 5660 seconds
[14:18]  <nagios-sjc1> [97] hg1.build.scl1:Mercurial mirror sync - /build/tools is CRITICAL: sync data is stale. 5680 seconds
[14:18]  <nagios-sjc1> [98] hg1.build.scl1:Mercurial mirror sync - /integration/mozilla-inbound is CRITICAL: sync data is stale. 5684 seconds
[14:18]  <nagios-sjc1> [99] hg1.build.scl1:Mercurial mirror sync - /hgcustom/hghooks is CRITICAL: sync data is stale. 5711 seconds
[14:18]  <nagios-sjc1> [00] hg1.build.scl1:Mercurial mirror sync - /build/buildbot-configs is CRITICAL: sync data is stale. 5711 seconds
We continued to get these:

Subject: ** PROBLEM alert - hg1.build.scl1/Mercurial mirror sync - /hgcustom/hghooks is CRITICAL **
Date: Wed, 21 Dec 2011 13:06:54 -0800 (PST)
From: nagios@dm-nagios01.mozilla.org (nagios)
***** Nagios  *****
Notification Type: PROBLEM
Service: Mercurial mirror sync - /hgcustom/hghooks
Host: hg1.build.scl1
Address: 10.12.51.200
State: CRITICAL
Date/Time: 12-21-2011 13:06:54
Additional Info:
sync data is stale. 12191 seconds

 
....and we just now got a nagios recovery alert 

-------- Original Message --------
Subject: ** RECOVERY alert - hg1.build.scl1/Mercurial mirror sync - /hgcustom/hg_templates is OK **
Date: Wed, 21 Dec 2011 13:18:02 -0800 (PST)
From: nagios@dm-nagios01.mozilla.org (nagios)
***** Nagios  *****
Notification Type: RECOVERY
Service: Mercurial mirror sync - /hgcustom/hg_templates
Host: hg1.build.scl1
Address: 10.12.51.200
State: OK
Date/Time: 12-21-2011 13:18:02
Additional Info:
SYNC OK
I fixed those - the monitoring script had stalled, with no actual impact (mirrors were up to date, although releng systems are resilient to out-of-date mirrors anyway).

Should we make those alert less often?

Aside from that, is there anything left in this bug?
I don't think the alerts are an issue if we have the means to fix it :)

nothing left that I'm aware of in this context
Awesome.  For the record, it's something oncall can fix if you can't raise anyone from relops.  We're still hunting the bug that periodically causes the mirror-sync errors (which is completely unrelated to the bug we saw here).
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [buildduty][outage]
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.