Closed
Bug 741628
Opened 13 years ago
Closed 13 years ago
Processors locking up, running out of db connections
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: lonnen, Assigned: mpressman)
Details
Processors have been locking up throughout the morning. Digging into the logs shows a couple of different errors:
"CRITICAL - MainThread - server failure in db transaction - retry in 300 seconds"
"waiting for retry after failure in crash storage"
http://pastebin.mozilla.org/1551047
restarting processors offers only temporary relief.
Jberkus dug into PG Bouncer logs and found thousands of the following:
"2012-04-02 15:08:03.593 12057 LOG C-0x1fde410: (nodb)/(nouser)@10.8.70.200:47434 closing because: client unexpected eof (age=0)"
restarting pgbouncer and then manually resetting each processor manually has not solved anything. The eof errors keep coming in. It looks like 20:00 UTC was the start of the problems.
Comment 1•13 years ago
|
||
I'm going to kick this over to IT for help. IT guys: if you have any suggestions - network problems? pgbouncer problems? then great. The timing doesn't suggest a code problem, but if you don't have anything you can kick it back to us.
Assignee: nobody → server-ops
Severity: normal → critical
Component: Infra → Server Operations
Product: Socorro → mozilla.org
QA Contact: infra → phong
Version: unspecified → other
Comment 2•13 years ago
|
||
We recently changed the bouncer db password. Could this be related?
Comment 3•13 years ago
|
||
Shouldn't be or it would be constant not intermittent.
mpressman, you just started bug 731011, right? So that shouldn't be it either,
Assignee | ||
Comment 4•13 years ago
|
||
laura, I just finished, and there are an awful lot of connections from the processors after restarting them
Assignee | ||
Comment 5•13 years ago
|
||
the number of connections from the processors has now gone back done to a more normal amount
Assignee | ||
Comment 6•13 years ago
|
||
Unfortunately, the pgbouncer logs don't go back that far, but having just witnessed what I believe to be the same activity, the logs show the spurious got packet 'E' from server when not linked message in the logs along with the output:
2012-04-02 19:25:38.602 14103 WARNING C-0x1299f30: breakpad/processor@10.8.70.200:34035 Pooler Error: server conn crashed?
2012-04-02 19:25:39.399 14103 WARNING C-0x1288840: breakpad/processor@10.8.70.200:9175 Pooler Error: no working server connection
Comment 7•13 years ago
|
||
Is that error from Postgres? I don't think we've ever seen that one before - Lars may know more, but I think he's stranded in a flood. ccing rhelmer as well, in case he has any ideas.
Comment 8•13 years ago
|
||
I believe that the error to which :mpressman refers is from pgbouncer not Postgres.
Updated•13 years ago
|
Assignee: server-ops → ashish
Comment 9•13 years ago
|
||
Looking through the processor logs on a few hosts, the last they error'd out on connecting to db was at 2012-04-02 19:32, which corroborates to #c4. They seem to be running fine since then.
(grep CRITICAL /var/log/socorro/socorro-processor.log on sp-processor08.phx1.mozilla.com):
2012-04-02 19:29:19,422 CRITICAL - Thread-5 - connection already closed
2012-04-02 19:29:19,422 CRITICAL - Thread-5 - trace back follows:
2012-04-02 19:29:19,423 CRITICAL - Thread-5 - Traceback (most recent call last):
2012-04-02 19:29:19,424 CRITICAL - Thread-5 - File "/data/socorro/application/socorro/processor/processor.py", line 491, in processJob
2012-04-02 19:29:19,424 CRITICAL - Thread-5 - File "/data/socorro/application/socorro/database/database.py", line 195, in connectionCursorPair
2012-04-02 19:29:19,425 CRITICAL - Thread-5 - InterfaceError: connection already closed
2012-04-02 19:29:19,425 CRITICAL - Thread-5 - major failure in crash storage - retry in 300 seconds
2012-04-02 19:29:23,916 CRITICAL - MainThread - server failure in db transaction - retry in 300 seconds
2012-04-02 19:33:48,326 CRITICAL - MainThread - server failure in db transaction - retry in 10 seconds
2012-04-02 19:33:58,336 CRITICAL - MainThread - server failure in db transaction - retry in 30 seconds
Dropping severity for now. Will monitor through the night.
Assignee: ashish → server-ops
Severity: critical → normal
Comment 10•13 years ago
|
||
(In reply to Phong Tran [:phong] from comment #2)
> We recently changed the bouncer db password. Could this be related?
What do you mean by this? The password to what?
Comment 11•13 years ago
|
||
The processors were quiet overnight a couple of times I checked. No new CRITICALs since #c9.
(In reply to [:jberkus] Josh Berkus from comment #10)
> What do you mean by this? The password to what?
Unrelated change. (mysql bouncer db vs. pgbouncer).
Updated•13 years ago
|
Assignee: server-ops → server-ops-database
Component: Server Operations → Server Operations: Database
QA Contact: phong → cshields
Assignee | ||
Updated•13 years ago
|
Assignee: server-ops-database → mpressman
Comment 12•13 years ago
|
||
Seems to be working now, reopen on recurrence
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Updated•11 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•