741628 - Processors locking up, running out of db connections

Reporter

Description

•

14 years ago

Processors have been locking up throughout the morning. Digging into the logs shows a couple of different errors: "CRITICAL - MainThread - server failure in db transaction - retry in 300 seconds" "waiting for retry after failure in crash storage" http://pastebin.mozilla.org/1551047 restarting processors offers only temporary relief. Jberkus dug into PG Bouncer logs and found thousands of the following: "2012-04-02 15:08:03.593 12057 LOG C-0x1fde410: (nodb)/(nouser)@10.8.70.200:47434 closing because: client unexpected eof (age=0)" restarting pgbouncer and then manually resetting each processor manually has not solved anything. The eof errors keep coming in. It looks like 20:00 UTC was the start of the problems.

Laura Thomson :laura

Comment 1

•

14 years ago

I'm going to kick this over to IT for help. IT guys: if you have any suggestions - network problems? pgbouncer problems? then great. The timing doesn't suggest a code problem, but if you don't have anything you can kick it back to us.

Assignee: nobody → server-ops

Severity: normal → critical

Component: Infra → Server Operations

Product: Socorro → mozilla.org

QA Contact: infra → phong

Version: unspecified → other

Phong Tran [:phong]

Comment 2

•

14 years ago

We recently changed the bouncer db password. Could this be related?

Laura Thomson :laura

Comment 3

•

14 years ago

Shouldn't be or it would be constant not intermittent. mpressman, you just started bug 731011, right? So that shouldn't be it either,

Matt Pressman [:mpressman]

Assignee

Comment 4

•

14 years ago

laura, I just finished, and there are an awful lot of connections from the processors after restarting them

Matt Pressman [:mpressman]

Assignee

Comment 5

•

14 years ago

the number of connections from the processors has now gone back done to a more normal amount

Matt Pressman [:mpressman]

Assignee

Comment 6

•

14 years ago

Unfortunately, the pgbouncer logs don't go back that far, but having just witnessed what I believe to be the same activity, the logs show the spurious got packet 'E' from server when not linked message in the logs along with the output: 2012-04-02 19:25:38.602 14103 WARNING C-0x1299f30: breakpad/processor@10.8.70.200:34035 Pooler Error: server conn crashed? 2012-04-02 19:25:39.399 14103 WARNING C-0x1288840: breakpad/processor@10.8.70.200:9175 Pooler Error: no working server connection

Laura Thomson :laura

Comment 7

•

14 years ago

Is that error from Postgres? I don't think we've ever seen that one before - Lars may know more, but I think he's stranded in a flood. ccing rhelmer as well, in case he has any ideas.

K Lars Lohn [:lars] [:klohn]

Comment 8

•

14 years ago

I believe that the error to which :mpressman refers is from pgbouncer not Postgres.

Ashish Vijayaram [:ashish]

Updated

•

14 years ago

Assignee: server-ops → ashish

Ashish Vijayaram [:ashish]

Comment 9

•

14 years ago

Looking through the processor logs on a few hosts, the last they error'd out on connecting to db was at 2012-04-02 19:32, which corroborates to #c4. They seem to be running fine since then. (grep CRITICAL /var/log/socorro/socorro-processor.log on sp-processor08.phx1.mozilla.com): 2012-04-02 19:29:19,422 CRITICAL - Thread-5 - connection already closed 2012-04-02 19:29:19,422 CRITICAL - Thread-5 - trace back follows: 2012-04-02 19:29:19,423 CRITICAL - Thread-5 - Traceback (most recent call last): 2012-04-02 19:29:19,424 CRITICAL - Thread-5 - File "/data/socorro/application/socorro/processor/processor.py", line 491, in processJob 2012-04-02 19:29:19,424 CRITICAL - Thread-5 - File "/data/socorro/application/socorro/database/database.py", line 195, in connectionCursorPair 2012-04-02 19:29:19,425 CRITICAL - Thread-5 - InterfaceError: connection already closed 2012-04-02 19:29:19,425 CRITICAL - Thread-5 - major failure in crash storage - retry in 300 seconds 2012-04-02 19:29:23,916 CRITICAL - MainThread - server failure in db transaction - retry in 300 seconds 2012-04-02 19:33:48,326 CRITICAL - MainThread - server failure in db transaction - retry in 10 seconds 2012-04-02 19:33:58,336 CRITICAL - MainThread - server failure in db transaction - retry in 30 seconds Dropping severity for now. Will monitor through the night.

Assignee: ashish → server-ops

Severity: critical → normal

[:jberkus] Josh Berkus

Comment 10

•

14 years ago

(In reply to Phong Tran [:phong] from comment #2) > We recently changed the bouncer db password. Could this be related? What do you mean by this? The password to what?

Ashish Vijayaram [:ashish]

Comment 11

•

14 years ago

The processors were quiet overnight a couple of times I checked. No new CRITICALs since #c9. (In reply to [:jberkus] Josh Berkus from comment #10) > What do you mean by this? The password to what? Unrelated change. (mysql bouncer db vs. pgbouncer).

Phong Tran [:phong]

Updated

•

14 years ago

Assignee: server-ops → server-ops-database

Component: Server Operations → Server Operations: Database

QA Contact: phong → cshields

Matt Pressman [:mpressman]

Assignee

Updated

•

14 years ago

Assignee: server-ops-database → mpressman

Laura Thomson :laura

Comment 12

•

14 years ago

Seems to be working now, reopen on recurrence

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Data & BI Services Team

Bugzilla

Processors locking up, running out of db connections

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: lonnen, Assigned: mpressman)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Updated