Closed Bug 551532 Opened 10 years ago Closed 10 years ago

Graph server crashed again

Categories

(mozilla.org Graveyard :: Server Operations, task, blocker)

x86
All
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: mrz)

References

Details

Graph server crashed out and needed reboot today. This happened also yesterday. 

Marking as blocker, because this closes the tree each time.

The RAM was upgraded Friday in bug#548371. Latest theory from irc was possibly slow disks, but uncertain. Also, unclear what has changed to make this a problem now.
Assignee: server-ops → aravind
zandr reports in irc that the clock for this VM is unable to keep up.

host-4-40:~ amilewski$ hostname ; date; ssh root@dm-graphs01 "hostname ; date" ; hostname ; date ; sleep 100 ; hostname ; date; ssh root@dm-graphs01 "hostname ; date" ; hostname ; date
host-4-40.mv.mozilla.com
Thu Mar 11 10:53:26 PST 2010
dm-graphs01.mozilla.org
Thu Mar 11 10:36:21 PST 2010
host-4-40.mv.mozilla.com
Thu Mar 11 11:02:34 PST 2010
host-4-40.mv.mozilla.com
Thu Mar 11 11:04:14 PST 2010
dm-graphs01.mozilla.org
Thu Mar 11 10:36:30 PST 2010
host-4-40.mv.mozilla.com
Thu Mar 11 11:19:15 PST 2010
host-4-40:~ amilewski$
/data2 is some 150GB drive that was on the SATA array.  I've since moved it to a FCAL shelf.  Appears to be historical data.

[root@dm-graphs01 data2]# ls -la
total 36
drwxr-xr-x  5 root  root   4096 Oct 13 18:41 .
drwxr-xr-x 25 root  root   4096 Mar 11 16:34 ..
drwx------  2 root  root  16384 Oct 13 18:36 lost+found
drwxr-xr-x  5 mysql mysql  4096 Oct 13 21:13 mysql
drwxr-xr-x  2 mysql mysql  4096 Oct 13 18:56 mysql-innodb

Box is back up with 2 vCPUs and 4GB RAM. 

[root@dm-graphs01 data2]# date
Thu Mar 11 16:49:28 PST 2010
machine running fine, but note that tree closure means no build load.

We're going to trigger a bunch of talos runs to generate load on graphserver, and see if it holds up. If that works, then we'll reopen the tree in approx 40-60 mins from now.
So far so good, tree handed back to developers.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Looks like the graph server is having those problems again. I have another "failed graph server post" on SeaMonkey trees, the Firefox tree also has some of those, #bmo has "<nagios> [96] dm-graphs01:http - graphs.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds" with no further message about it coming back, and when anything loads on http://graphs.mozilla.org/ at all with a very long lag, it can't actually get to a point where it has any data or graphs in it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
catlee and justdave are setting up physical hardware to replace the current VM. This *should* be just a drop-in replacement, with no need to reconfigure talos masters/slaves, but stay tuned.
Reassigning to mrz, since he was last working on this (or maybe justdave is)
Assignee: aravind → mrz
New box is up and running as of this afternoon, now running on physical hardware.  The old box is still there and still hosting graphs-old and graphs-historical, but due to the external IP address getting re-mapped, it's now only accessible from behind the VPN.  If there's a need to have this data publicly available still (does anyone still use it?) we can make further arrangements.

The entire config from top to bottom is now in puppet, and set up in a way that would make it incredibly simple to add additional webheads to host this if we find ourselves running into load issues again.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.