Closed Bug 551532 Opened 10 years ago Closed 10 years ago
Graph server crashed again
Graph server crashed out and needed reboot today. This happened also yesterday. Marking as blocker, because this closes the tree each time. The RAM was upgraded Friday in bug#548371. Latest theory from irc was possibly slow disks, but uncertain. Also, unclear what has changed to make this a problem now.
zandr reports in irc that the clock for this VM is unable to keep up. host-4-40:~ amilewski$ hostname ; date; ssh root@dm-graphs01 "hostname ; date" ; hostname ; date ; sleep 100 ; hostname ; date; ssh root@dm-graphs01 "hostname ; date" ; hostname ; date host-4-40.mv.mozilla.com Thu Mar 11 10:53:26 PST 2010 dm-graphs01.mozilla.org Thu Mar 11 10:36:21 PST 2010 host-4-40.mv.mozilla.com Thu Mar 11 11:02:34 PST 2010 host-4-40.mv.mozilla.com Thu Mar 11 11:04:14 PST 2010 dm-graphs01.mozilla.org Thu Mar 11 10:36:30 PST 2010 host-4-40.mv.mozilla.com Thu Mar 11 11:19:15 PST 2010 host-4-40:~ amilewski$
/data2 is some 150GB drive that was on the SATA array. I've since moved it to a FCAL shelf. Appears to be historical data. [root@dm-graphs01 data2]# ls -la total 36 drwxr-xr-x 5 root root 4096 Oct 13 18:41 . drwxr-xr-x 25 root root 4096 Mar 11 16:34 .. drwx------ 2 root root 16384 Oct 13 18:36 lost+found drwxr-xr-x 5 mysql mysql 4096 Oct 13 21:13 mysql drwxr-xr-x 2 mysql mysql 4096 Oct 13 18:56 mysql-innodb Box is back up with 2 vCPUs and 4GB RAM. [root@dm-graphs01 data2]# date Thu Mar 11 16:49:28 PST 2010
machine running fine, but note that tree closure means no build load. We're going to trigger a bunch of talos runs to generate load on graphserver, and see if it holds up. If that works, then we'll reopen the tree in approx 40-60 mins from now.
10 years ago
So far so good, tree handed back to developers.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Looks like the graph server is having those problems again. I have another "failed graph server post" on SeaMonkey trees, the Firefox tree also has some of those, #bmo has "<nagios>  dm-graphs01:http - graphs.mozilla.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds" with no further message about it coming back, and when anything loads on http://graphs.mozilla.org/ at all with a very long lag, it can't actually get to a point where it has any data or graphs in it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
catlee and justdave are setting up physical hardware to replace the current VM. This *should* be just a drop-in replacement, with no need to reconfigure talos masters/slaves, but stay tuned.
Reassigning to mrz, since he was last working on this (or maybe justdave is)
Assignee: aravind → mrz
New box is up and running as of this afternoon, now running on physical hardware. The old box is still there and still hosting graphs-old and graphs-historical, but due to the external IP address getting re-mapped, it's now only accessible from behind the VPN. If there's a need to have this data publicly available still (does anyone still use it?) we can make further arrangements. The entire config from top to bottom is now in puppet, and set up in a way that would make it incredibly simple to add additional webheads to host this if we find ourselves running into load issues again.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.