Closed Bug 741628 Opened 13 years ago Closed 13 years ago

Processors locking up, running out of db connections

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: lonnen, Assigned: mpressman)

Details

Processors have been locking up throughout the morning. Digging into the logs shows a couple of different errors: "CRITICAL - MainThread - server failure in db transaction - retry in 300 seconds" "waiting for retry after failure in crash storage" http://pastebin.mozilla.org/1551047 restarting processors offers only temporary relief. Jberkus dug into PG Bouncer logs and found thousands of the following: "2012-04-02 15:08:03.593 12057 LOG C-0x1fde410: (nodb)/(nouser)@10.8.70.200:47434 closing because: client unexpected eof (age=0)" restarting pgbouncer and then manually resetting each processor manually has not solved anything. The eof errors keep coming in. It looks like 20:00 UTC was the start of the problems.
I'm going to kick this over to IT for help. IT guys: if you have any suggestions - network problems? pgbouncer problems? then great. The timing doesn't suggest a code problem, but if you don't have anything you can kick it back to us.
Assignee: nobody → server-ops
Severity: normal → critical
Component: Infra → Server Operations
Product: Socorro → mozilla.org
QA Contact: infra → phong
Version: unspecified → other
We recently changed the bouncer db password. Could this be related?
Shouldn't be or it would be constant not intermittent. mpressman, you just started bug 731011, right? So that shouldn't be it either,
laura, I just finished, and there are an awful lot of connections from the processors after restarting them
the number of connections from the processors has now gone back done to a more normal amount
Unfortunately, the pgbouncer logs don't go back that far, but having just witnessed what I believe to be the same activity, the logs show the spurious got packet 'E' from server when not linked message in the logs along with the output: 2012-04-02 19:25:38.602 14103 WARNING C-0x1299f30: breakpad/processor@10.8.70.200:34035 Pooler Error: server conn crashed? 2012-04-02 19:25:39.399 14103 WARNING C-0x1288840: breakpad/processor@10.8.70.200:9175 Pooler Error: no working server connection
Is that error from Postgres? I don't think we've ever seen that one before - Lars may know more, but I think he's stranded in a flood. ccing rhelmer as well, in case he has any ideas.
I believe that the error to which :mpressman refers is from pgbouncer not Postgres.
Assignee: server-ops → ashish
Looking through the processor logs on a few hosts, the last they error'd out on connecting to db was at 2012-04-02 19:32, which corroborates to #c4. They seem to be running fine since then. (grep CRITICAL /var/log/socorro/socorro-processor.log on sp-processor08.phx1.mozilla.com): 2012-04-02 19:29:19,422 CRITICAL - Thread-5 - connection already closed 2012-04-02 19:29:19,422 CRITICAL - Thread-5 - trace back follows: 2012-04-02 19:29:19,423 CRITICAL - Thread-5 - Traceback (most recent call last): 2012-04-02 19:29:19,424 CRITICAL - Thread-5 - File "/data/socorro/application/socorro/processor/processor.py", line 491, in processJob 2012-04-02 19:29:19,424 CRITICAL - Thread-5 - File "/data/socorro/application/socorro/database/database.py", line 195, in connectionCursorPair 2012-04-02 19:29:19,425 CRITICAL - Thread-5 - InterfaceError: connection already closed 2012-04-02 19:29:19,425 CRITICAL - Thread-5 - major failure in crash storage - retry in 300 seconds 2012-04-02 19:29:23,916 CRITICAL - MainThread - server failure in db transaction - retry in 300 seconds 2012-04-02 19:33:48,326 CRITICAL - MainThread - server failure in db transaction - retry in 10 seconds 2012-04-02 19:33:58,336 CRITICAL - MainThread - server failure in db transaction - retry in 30 seconds Dropping severity for now. Will monitor through the night.
Assignee: ashish → server-ops
Severity: critical → normal
(In reply to Phong Tran [:phong] from comment #2) > We recently changed the bouncer db password. Could this be related? What do you mean by this? The password to what?
The processors were quiet overnight a couple of times I checked. No new CRITICALs since #c9. (In reply to [:jberkus] Josh Berkus from comment #10) > What do you mean by this? The password to what? Unrelated change. (mysql bouncer db vs. pgbouncer).
Assignee: server-ops → server-ops-database
Component: Server Operations → Server Operations: Database
QA Contact: phong → cshields
Assignee: server-ops-database → mpressman
Seems to be working now, reopen on recurrence
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.