Closed Bug 796991 Opened 12 years ago Closed 12 years ago

Perma-red talos with: "FAIL: Graph server unreachable (5 attempts) ... send failed, graph server says: ... Service Unavailable"

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: dustin)

Details

I don't think we've changed anything on the releng side here. Anything funky with the DB/web heads?
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: arich
utils.talosError: 'Graph server unreachable (5 attempts)\nsend failed, graph server says:\n<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html;charset=utf-8">\n<title>Service Unavailable</title>\n<style type="text/css">\nbody, p, h1 {\n  font-family: Verdana, Arial, Helvetica, sans-serif;\n}\nh2 {\n  font-family: Arial, Helvetica, sans-serif;\n  color: #b10b29;\n}\n</style>\n</head>\n<body>\n<h2>Service Unavailable</h2>\n<p>The service is temporarily unavailable. Please try again later.</p>\n</body>\n</html>\n'

boo :/
Assignee: server-ops-releng → dustin
Looks ok right now, but I see several exceptions like this:

[Tue Oct 02 09:42:17 2012] [error] unable to insert new record into 'test_run_values': (
1062, "Duplicate entry '19482821-0' for key 'PRIMARY'")
[Tue Oct 02 09:42:17 2012] [error]   File "/var/www/html/graphs/server/pyfomatic/collect
.py", line 273, in handleRequest
[Tue Oct 02 09:42:17 2012] [error]     average = valuesReader(databaseCursor, databaseModule, inputStream, metadata)
[Tue Oct 02 09:42:17 2012] [error]   File "/var/www/html/graphs/server/pyfomatic/collect.py", line 210, in valuesReader
[Tue Oct 02 09:42:17 2012] [error]     raise DatabaseException("unable to insert new record into 'test_run_values': %s" % str(x))
Earlier than that there were a ton of:

[Tue Oct 02 09:37:19 2012] [error] (2006, 'MySQL server has gone away')

starting at 09:28:04

It looks like the exceptions from comment 3 (right after the server was reachable again) all have the same timestamp, things have been looking normal since then.
Seeing some green; trees reopened.
Severity: blocker → major
There were some replication errors earlier that got the auto_increment out of sync with the rows in the table.  I'm surprised they didn't manifest until 9:42 (the replication errors were about an hour earlier).  Anyway, the few rhelmer identified in comment 3 were the only problems, afaict, so we shouldn't see any more failures (and haven't for almost 45m now).

Bug 796936 is open (before this one!) to fix the auto_increment problem.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Thank you :-)
Severity: major → blocker
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.