Closed Bug 638010 Opened 14 years ago Closed 9 years ago

"deadlock detected" in PHX monitor/processors

Categories

(Socorro :: General, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: rhelmer, Unassigned)

References

Details

Monitor and all processors went down around 2011-03-01 21:15, with error messages such as this from processor09: """ 2011-03-01 21:05:08,456 CRITICAL - Thread-7 - something's gone horribly wrong wi th the database connection 2011-03-01 21:05:08,457 CRITICAL - Thread-7 - Caught Error: <class 'psycopg2.ext ensions.TransactionRollbackError'> 2011-03-01 21:05:08,458 CRITICAL - Thread-7 - deadlock detected DETAIL: Process 20775 waits for RowExclusiveLock on relation 133535 of database 48819; blocked by process 1817. Process 1817 waits for AccessExclusiveLock on relation 49320 of database 48819; blocked by process 20775. HINT: See server log for query details. """ puppet picked things back up as bkero and I were looking into it, and everything seems ok as of 2011-03-01 21:25.
We've seen this once before. See bug 575760. Did it occur when the create_partitions cron was running?
(In reply to comment #1) > We've seen this once before. See bug 575760. Did it occur when the > create_partitions cron was running? Yes: 05 21 * * 2 socorro /data/socorro/application/scripts/crons/cron_create_partitions.sh From /var/log/socorro/cron_create_partitions.log: started 2011-03-01 21:05:02 completed 2011-03-01 21:05:08
we need to look ahead and see if the partitions for the next four weeks exist and owned by the correct user. If they don't exist, create them manually. If they don't have the same owner as the rest of the partitions, that should get corrected, too. Like last time, this problem is a rarity. When we get around to refactoring the SQL code, we'll rework the partition creation code with an eye for prevention of this problem.
(In reply to comment #3) > we need to look ahead and see if the partitions for the next four weeks exist > and owned by the correct user. If they don't exist, create them manually. If > they don't have the same owner as the rest of the partitions, that should get > corrected, too. Looks ok to me: breakpad=> \dt *_201103* List of relations Schema | Name | Type | Owner --------+--------------------------+-------+------------- public | extensions_20110307 | table | breakpad_rw public | extensions_20110314 | table | breakpad_rw public | extensions_20110321 | table | breakpad_rw public | extensions_20110328 | table | breakpad_rw public | frames_20110307 | table | breakpad_rw public | frames_20110314 | table | breakpad_rw public | frames_20110321 | table | breakpad_rw public | frames_20110328 | table | breakpad_rw public | plugins_reports_20110307 | table | breakpad_rw public | plugins_reports_20110314 | table | breakpad_rw public | plugins_reports_20110321 | table | breakpad_rw public | plugins_reports_20110328 | table | breakpad_rw public | reports_20110307 | table | breakpad_rw public | reports_20110314 | table | breakpad_rw public | reports_20110321 | table | breakpad_rw public | reports_20110328 | table | breakpad_rw (16 rows)
That looks good to me too. I suggest rather than trying to chase the cause of the deadlock, we should defer the solution to time that we refactor the SQL in the system.
Lars, Rob, Creating a new partition actually takes a lock on the reports table (etc.) which actually blocks read queries as well as writes. Hence, the deadlock. The only real way to avoid this is to take an explicit exclusive lock on the partitioned tables, with NOWAIT in a retry loop. That'll still block, but won't create deadlocks.
Component: Socorro → General
Product: Webtools → Socorro
Too old.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.