Last Comment Bug 859315 - lots of "Lock wait timeout exceeded" errors when updating cf_crash_signature
: lots of "Lock wait timeout exceeded" errors when updating cf_crash_signature
Status: VERIFIED FIXED
:
Product: bugzilla.mozilla.org
Classification: Other
Component: General (show other bugs)
: Production
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: Byron Jones ‹:glob› [PTO until 2017-01-09]
:
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-04-08 07:00 PDT by Byron Jones ‹:glob› [PTO until 2017-01-09]
Modified: 2013-04-23 23:59 PDT (History)
8 users (show)
See Also:
Due Date:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Byron Jones ‹:glob› [PTO until 2017-01-09] 2013-04-08 07:00:16 PDT
we're seeing lots of "Lock wait timeout exceeded" errors when updating cf_crash_signature in the error logs:

[Mon Apr 08 02:04:30 2013] [error] DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction [for Statement "UPDATE bugs SET cf_crash_signature = ? WHERE bug_id = ?"] at /data/www/bugzilla.mozilla.org/Bugzilla/Object.pm line 439.

other timestamps:
Apr 08 02:10:20
Apr 05 06:01:35
Apr 04 04:55:55
Apr 04 07:22:09
Apr 04 07:25:44
Apr 04 10:04:44
Apr 04 10:33:05
Apr 04 00:20:43
Apr 04 05:49:57
Apr 04 08:34:06
Apr 04 10:34:59
Apr 04 04:59:36
Apr 04 05:26:41
Apr 04 08:25:51
Apr 04 11:45:47
Apr 03 15:29:36
Apr 03 15:32:29
Apr 03 15:36:06
Apr 03 15:48:18
Apr 03 23:37:08
Apr 03 15:26:54
Apr 03 15:28:25
Apr 03 15:53:54
Apr 03 17:05:36
Apr 03 23:37:43
Apr 03 15:41:15
Apr 03 15:27:32
Apr 03 15:52:14
Apr 03 16:09:24
Apr 03 17:48:07
Comment 1 Sheeri Cabral [:sheeri] 2013-04-08 07:07:32 PDT
cf_crash_signature is a mediumtext value, but so are these:
  `bug_file_loc` mediumtext NOT NULL,
  `status_whiteboard` mediumtext NOT NULL,
Comment 2 Sheeri Cabral [:sheeri] 2013-04-08 07:08:38 PDT
None of the three mediumtext fields have any indexes on them.
Comment 3 Sheeri Cabral [:sheeri] 2013-04-08 07:15:01 PDT
Some length information:


MariaDB [bugs]>  select max(length(bug_file_loc)), max(length(status_whiteboard)), max(length(cf_crash_signature)) from bugs\G
*************************** 1. row ***************************
      max(length(bug_file_loc)): 65535
 max(length(status_whiteboard)): 237
max(length(cf_crash_signature)): 17084
1 row in set (8.39 sec)

looks like bug_file_loc takes up all the space we have (and we might be losing data), status_whiteboard is pretty small (probably because it's only a textfield in the GUI, so ppl don't put things that are too long in there, and cf_crash_signature is appropriately sized.
Comment 4 Sheeri Cabral [:sheeri] 2013-04-08 07:29:49 PDT
MariaDB has some extra deadlock variables:


MariaDB [bugs]> show variables like '%deadlock%';
+-----------------------------+----------+
| Variable_name               | Value    |
+-----------------------------+----------+
| deadlock_search_depth_long  | 15       |
| deadlock_search_depth_short | 4        |
| deadlock_timeout_long       | 50000000 |
| deadlock_timeout_short      | 10000    |
+-----------------------------+----------+
4 rows in set (0.00 sec)

The timeouts are in microseconds, so the short timeout is 0.01 seconds, long is 50 seconds.
Comment 5 Sheeri Cabral [:sheeri] 2013-04-08 07:33:57 PDT
According to https://kb.askmonty.org/en/mysqld-options-full-list/:

--deadlock-search-depth-long=# 	Long search depth for the two-step deadlock detection	
--deadlock-search-depth-short=# 	Short search depth for the two-step deadlock detection	
--deadlock-timeout-long=# 	Long timeout for the two-step deadlock detection (in microseconds)	
--deadlock-timeout-short=# 	Short timeout for the two-step deadlock detection (in microseconds)

I have to learn more about this 2-step deadlock detection....but that might be part of it. Maybe we're seeing the deadlock problems in the 1st step and they're not real problems?
Comment 6 Sheeri Cabral [:sheeri] 2013-04-08 07:48:31 PDT
https://kb.askmonty.org/en/server-system-variables/#new-variables-in-mariadb doesn't list these variables, sadly (nowhere on the page).
Comment 7 Sheeri Cabral [:sheeri] 2013-04-08 07:51:04 PDT
sheeri: glob: When has this been happening since? 4/3 or earlier?
[10:45am] sheeri: we swapped bugzilla's master last week.
[10:45am] sheeri: so I want to rule that out if possible.
[10:46am] • glob looks
[10:46am] glob: and, yeah, that looked like fun times for all
[10:46am] glob: did said fun happen on the 3rd?
[10:48am] glob: 3rd, 4th, and 5th are busy (as per the bug).  then i have to go back to march 22nd for the next instance
[10:48am] sheeri: huh
[10:49am] sheeri: I wonder if that has to do with the caching that has to happen because the failover isn't as "hot" with the data?

I know we do want to flip this back...
Comment 8 Sheeri Cabral [:sheeri] 2013-04-08 07:53:12 PDT
bugzilla1 wasn't recently restarted so flipping back shouldn't be as much of a problem. I want to do a risk assessment with the db team in our meeting today about flipping back before we figure out what this is.
Comment 9 Sheeri Cabral [:sheeri] 2013-04-08 07:55:06 PDT
glob: since the move to scl3, just 7 occurrences on mar 22nd.
[10:54am] sheeri: interesting. If it were caching, we'd have seen it just after the switch too.

I'm thinking it might not be related to the flip.
Comment 10 Sheeri Cabral [:sheeri] 2013-04-08 08:10:42 PDT
serg: sheeri: unfortunately, these variables won't help. This "two-step deadlock detection" was added back in MySQL time, as a general purpose all-server deadlock detector, but none of the server subsystem used it in MySQL, and it was removed. In MariaDB only Aria uses it at the moment, so it's of not much use.
[11:10am] sheeri: oh! serg that's great to know, it at least rules this out (That table is innodb).

So not related to this.
Comment 11 Sheeri Cabral [:sheeri] 2013-04-08 08:15:34 PDT
Just for more data:

| innodb_lock_wait_timeout   | 50       |
Comment 12 Sheeri Cabral [:sheeri] 2013-04-08 08:27:25 PDT
Just FYI there's nothing out of place on the master error log.
Comment 13 Sheeri Cabral [:sheeri] 2013-04-09 08:41:48 PDT
Thinking about this more, deadlocks sounded familiar. Is this another instance of https://bugzilla.mozilla.org/show_bug.cgi?id=784350 ?
Comment 14 Sheeri Cabral [:sheeri] 2013-04-10 12:30:04 PDT
Is this still happening or has it trailed off?
Comment 15 Scoobidiver (away) 2013-04-10 12:54:04 PDT
(In reply to Sheeri Cabral [:sheeri] from comment #14)
> Is this still happening or has it trailed off?
The last time it happened, it was on April 9th at 17:46:25 UTC and during a few hours after that. I am going to add a crash signature to bug 848913 that reliably caused that when I have finished my today's work on bugs (it locks the database for me during several hours).
Comment 16 Byron Jones ‹:glob› [PTO until 2017-01-09] 2013-04-10 22:19:51 PDT
it's still happening, but with a much lower frequency.

there's been two occurrences since the 8th:
[Wed Apr 10 14:08:43 2013]
[Tue Apr 09 10:16:19 2013]

which appear to sync with scoobidiver's updates.

scoobidiver: do you update bugs in batches?  eg. open lots of tabs, and update them in quick succession.
Comment 17 Scoobidiver (away) 2013-04-10 23:31:28 PDT
After adding a crash signature to bug 848913, submitting it and waiting more than two minutes while connecting (it's very long), I got again that on April 20 at 21:09 UTC:
DBD::mysql::db do failed: Lock wait timeout exceeded; try restarting transaction [for Statement "UPDATE bugs SET cf_crash_signature = ? WHERE bug_id = ?"]
Then any bugs where I comment or add a signature are locked during a few hours.

(In reply to Byron Jones ‹:glob› from comment #16)
> it's still happening, but with a much lower frequency.
It's the same frequency as before for me because of bug 848913. It happened to me only once without bug 848913 being the cause.

> scoobidiver: do you update bugs in batches?  eg. open lots of tabs, and
> update them in quick succession.
I do that way rarely. There's at least one minute between two submissions especially since I encounter this bug and bug 849687 just before.
Comment 18 Wayne Mery (:wsmwk, NI for questions) 2013-04-13 17:42:09 PDT
seeing this tonight with bug 803460.
3 times

twice with comment and signature change.
then, trying a simple comment
Comment 19 Byron Jones ‹:glob› [PTO until 2017-01-09] 2013-04-18 10:56:07 PDT
this appears to happen only when updating cf_crash_signature with large-ish values.

this error has been reported while updating the following bugs (and probably more bugs):
bug 803460
bug 848913
bug 803460
Comment 20 Sheeri Cabral [:sheeri] 2013-04-18 11:02:18 PDT
Wayne reported it happened again with bug 803460.

I have pt-deadlock-logger running on the bugzilla master db, but the most recent deadlock was 5h ago. :(
Comment 21 Scoobidiver (away) 2013-04-18 11:49:38 PDT
(In reply to Byron Jones ‹:glob› from comment #19)
> this appears to happen only when updating cf_crash_signature with large-ish
> values.
It's true but under certain conditions. Indeed, I updated the crash signature of bug 854082 without any problem.
Those faulty bugs all have alloc in the crash signature.
Comment 22 Sheeri Cabral [:sheeri] 2013-04-18 11:59:50 PDT
So we are currently seeing it with bug 803460. I tried to cc myself, and looked at the SHOW ENGINE INNODB STATUS in the TRANSACTIONS part, and here's my query, stuck waiting for a lock:

---TRANSACTION 5903BCB1, ACTIVE 6 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 376, 1 row lock(s)
MySQL thread id 20071526, OS thread handle 0x7fa3c5a51700, query id 1241799607 10.22.70.209 bugs update
INSERT INTO cc (bug_id, who) VALUES ('803460','430503')
Trx read view will not see trx with id >= 5903BCB2, sees < 58EC827F
------- TRX HAS BEEN WAITING 6 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 309 page no 59509 n bits 88 index `PRIMARY` of table `bugs`.`bugs` trx id 5903BCB1 lock mode S locks rec but not gap waiting
------------------
TABLE LOCK table `bugs`.`cc` trx id 5903BCB1 lock mode IX
TABLE LOCK table `bugs`.`bugs` trx id 5903BCB1 lock mode IS
RECORD LOCKS space id 309 page no 59509 n bits 88 index `PRIMARY` of table `bugs`.`bugs` trx id 5903BCB1 lock mode S locks rec but not gap waiting


So I looked further at what could be locking this and see:
---TRANSACTION 58EC827F, ACTIVE 4306 sec inserting
mysql tables in use 1, locked 1
8 lock struct(s), heap size 1248, 3 row lock(s), undo log entries 2574435
MySQL thread id 20004967, OS thread handle 0x7fa8b9e59700, query id 1241807666 10.22.70.209 bugs update
INSERT INTO bugs_activity
                  (bug_id, who, bug_when, fieldid, removed, added, comment_id)
                  VALUES ('803460', '29811', '2013-04-18 10:32:21', '120', '', '', NULL)
Trx read view will not see trx with id >= 58EC8280, sees < 58EC8280
TABLE LOCK table `bugs`.`bugs` trx id 58EC827F lock mode IX
RECORD LOCKS space id 309 page no 59509 n bits 88 index `PRIMARY` of table `bugs`.`bugs` trx id 58EC827F lock_mode X locks rec but not gap
TABLE LOCK table `bugs`.`longdescs` trx id 58EC827F lock mode IX
TABLE LOCK table `bugs`.`profiles` trx id 58EC827F lock mode IS
RECORD LOCKS space id 191 page no 343 n bits 232 index `PRIMARY` of table `bugs`.`profiles` trx id 58EC827F lock mode S locks rec but not gap
TABLE LOCK table `bugs`.`bugs_activity` trx id 58EC827F lock mode IX
TABLE LOCK table `bugs`.`fielddefs` trx id 58EC827F lock mode IS
RECORD LOCKS space id 262 page no 9 n bits 280 index `PRIMARY` of table `bugs`.`fielddefs` trx id 58EC827F lock mode S locks rec but not gap

Note that it's an insert into bugs_activity. I looked at thread 20004967 and saw that it was sleeping, and so I killed it, and it's currently in the process of rolling back:
| 20004967 | bugs        | 10.22.70.209:17291 | bugs | Killed      |     113 | NULL                                                                        | NULL             |    0.000 |

Once that finishes rolling back, I'll see if I can cc myself on the bug (it should work easily). If that's the case, we have to figure out why that transaction never finished.
Comment 23 Byron Jones ‹:glob› [PTO until 2017-01-09] 2013-04-18 21:49:00 PDT
sheeri was able to cc herself on that bug, and that hung transaction explains why you see timeouts when trying to update other bugs once cf_crash_signature has been updated.

that we're inserting empty 'removed' and 'added' into bugs_activity points to an issue with the code which splits large changes into chunks for insertion into that table -- i suspect it's stuck in an infinite loop and never closes the transaction.

Scoobidiver and/or wsmwk - can you please attach to this bug a value for crash-signature which you've tried to set that resulted in a timeout?
Comment 24 Scoobidiver (away) 2013-04-18 23:54:20 PDT
(In reply to Byron Jones ‹:glob› from comment #23)
> Scoobidiver and/or wsmwk - can you please attach to this bug a value for
> crash-signature which you've tried to set that resulted in a timeout?
Adding [@ mozalloc_abort(char const* const) | NS_DebugBreak] to bug 848913 reproduces the issue for me.
Comment 25 Wayne Mery (:wsmwk, NI for questions) 2013-04-19 02:55:11 PDT
adding [@ XPCNativeSet::Mark()]  to bug 803460
Comment 26 Byron Jones ‹:glob› [PTO until 2017-01-09] 2013-04-22 01:18:40 PDT
the problem was the code which splits the changed value into chunks small enough to insert into the bugs_activity table was hitting an infinite loop, resulting in the hanging transaction.  this only happened when a comma was at a particular location into the string.

i had already fixed this issue upstream (bug 772359), but failed to backport the complete fix to bmo-4.2 :(

Committing to: bzr+ssh://bjones%40mozilla.com@bzr.mozilla.org/bmo/4.2/
modified Bugzilla/Bug.pm
modified Bugzilla/Util.pm
Committed revision 8745.
Comment 27 Scoobidiver (away) 2013-04-22 01:43:08 PDT
Is it pushed to prod? I can still reproduce the issue with bug 848913.
Comment 28 Reed Loden [:reed] (use needinfo?) 2013-04-22 10:28:07 PDT
Push to prod only happens once a week. Check back after Wednesday evening. :)
Comment 29 Byron Jones ‹:glob› [PTO until 2017-01-09] 2013-04-23 22:44:45 PDT
this has been pushed to production.
Comment 30 Scoobidiver (away) 2013-04-23 23:59:29 PDT
(In reply to Byron Jones ‹:glob› from comment #29)
> this has been pushed to production.
Verified with bug 848913. Thanks.

Note You need to log in before you can comment on or make changes to this bug.