Lots of alerts like <nagios-releng> Wed 14:54:57 PDT   buildbot-master77.bb.releng.use1.mozilla.com:Command Queue is CRITICAL: 2 dead items recently, which turn out to be timeouts after 2 minutes trying to add large changes to the statusdb, eg 2017-05-10 14:00:14,579 - INSERT INTO sourcestamps (branch, revision, patch_id) VALUES (%s, %s, %s) 2017-05-10 14:00:14,579 - ('integration/autoland', 'eb62dc9d8524742ec288004e12d6380f1535c031', None) 2017-05-10 14:00:14,655 - INSERT INTO changes (number, branch, revision, who, comments, `when`) VALUES (%s, %s, %s, %s, %s, %s) 2017-05-10 14:00:14,655 - (9227481L, 'integration/autoland', 'eb62dc9d8524742ec288004e12d6380f1535c031', 'firstname.lastname@example.org', 'Merge mozilla-central to autoland', datetime.datetime(2017, 5, 10, 13, 36, 40))2017-05-10 14:00:14,978 - INSERT INTO file_changes (file_id, change_id) VALUES (%s, %s) 2017-05-10 14:00:14,978 - ((3840659L, 6514173L), (979845L, 6514173L), (712747L, 6514173L), (1945983L, 6514173L), (1946189L, 6514173L)... A work around is to just increase the -m argument in /etc/init.d/command_runner (from 60 to 600), so it has time to finish once, then subsequent jobs for the change are quick. We can also look at if anything uses the file lists on changes; if not we can stop inserting them.
Created attachment 8866544 [details] [diff] [review] [puppet] Increase timeout to 10 minutes Workaround/stop gap solution to avoid manual work recovering dead queues.
Attachment #8866544 - Flags: review?(catlee)
https://hg.mozilla.org/build/puppet/rev/b508b69c55a82f2d10590597797188cfe558b885 Bug 1363890: Increase command queue timeout to 10 minutes. r=catlee
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.