Closed
Bug 659724
Opened 14 years ago
Closed 13 years ago
tinderbox queue runner is not keeping up (dm-webtools02)
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: bear)
References
Details
(Whiteboard: [tinderbox][buildduty])
Attachments
(1 file)
We were just alerted that the tinderbox mail queue is 12 minutes long. This means that the MTA's are doing their job, but that the queue runner can't keep up. We're discussing parallelizing that queue-running somehow; here's the IRC convo: 14:14 < justdave> right now the cron job fires once per minute, and takes an exclusive lock on the queue 14:14 < justdave> it runs until it finishes the entire queue then exits 14:14 < bear-lunch> assign it to me so I can add it to my todo list when I review things with coop 14:14 < dustin> bear-lunch: you got it 14:14 < justdave> if the cron job fires again while it's still running, it sees the lock in place and exits 14:15 * joduinn happy that the nagios check caught it so we could all see the state of the world while it was still happening 14:15 < justdave> if we locked individual queue files instead of the entire queue, each invocation of the cron job could add an additional queue runner 14:15 < armenzg_buildduty> what is the hostname that processes this? 14:15 < justdave> so as it started getting behind, we'd add a new runner every minute until they all caught up 14:15 < bear-lunch> justdave - yea, we could do that with some sane upper limit 14:15 < justdave> yeah, 4 would probably be a good limit 14:15 < dustin> armenzg_buildduty: dm-webtools02 14:15 < justdave> seeing how there's 4 cores on the CPU Over to bear to look into this and triage.
Assignee | ||
Comment 1•14 years ago
|
||
another idea from the same IRC conversation: [15:19] <justdave> bear-lunch: another thought... right now I think it sets a hash internally of what trees have been touched while it's processing, and then once it's finished with the queue, it updates the cache on all of those trees. [15:20] <justdave> we could have it touch some file in each tree that says "this tree's cache is out of date" and have a separate cron job come and rebuild those
Comment 2•14 years ago
|
||
See also explanation on dev.tree-management http://tinyurl.com/3puvc2f
Updated•14 years ago
|
Whiteboard: [tinderbox] → [tinderbox][buildduty]
Comment 3•14 years ago
|
||
[16:04] <nagios-sjc1> [06] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 18 minutes old [16:24] <nagios-sjc1> [12] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 21 minutes old
Comment 4•14 years ago
|
||
Getting bug 661365 security review would obsolete this bug.
Comment 5•14 years ago
|
||
Should we dupe this to bug 657521? This was just to cover the occurrence last time it happened.
Assignee | ||
Comment 6•14 years ago
|
||
no, this is the code change needed to try and avoid the nagios alerts. the other bug is a hardware fix.
Comment 7•14 years ago
|
||
Lag is currently over 2k messages and 40 minutes, and we'll have to see if that subsides once nightlies clear away. Have we heard anything to confirm the security review is proceeding ? Should we be looking at threading (or something) at http://mxr.mozilla.org/mozilla/source/webtools/tinderbox/processbuild.pl#67: 67 for my $file (@datafiles) { 68 &process_mailfile("$::data_dir/$file"); 69 } Beware the global vars at lines 39-43.
Assignee | ||
Comment 8•14 years ago
|
||
The person doing the security review (forget his name) let us know that he would working on the review, so it is moving. The loop you mentioned above is close to what I was going to do after I review the code to make sure none of the called routines (like scrape.pl and some others) don't adjust some of the global data files - they are not async friendly. The only one that really has me worried is limiting it to one process per tree
Comment 9•14 years ago
|
||
So, the lag problem went away when the 'monthly backup' finished, removing a bunch of load from the box & SAN ?
Reporter | ||
Comment 10•14 years ago
|
||
Well, the alerts stopped, but it was also the weekend. We may be in the eye of the storm now.
Comment 11•13 years ago
|
||
Once we switch over to using TBPL, we'll no longer need to send emails, hence no longer concerned about this queue length. Keeping this bug open for now, as it remains a production concern until we switch over to tbpl.
Depends on: 625979
Reporter | ||
Comment 12•13 years ago
|
||
Details from notes from months-ago IRC conversations: Switching to the new TBPL implementation is a step, and at some point makes the box no longer a concern of releng's, but does not allow tinderbox.mozilla.org to be turned off. First, it still provides web views that are useful to some people (aki, at least) and are not available elsewhere. To continue to support these views, releng should continue sending mails to tinderbox, at least for some branches. Second, there are external projects, NSPR among them, which still use tinderbox as their primary build-reporting tool. Of course, those are much lower load than the releng builds, so the queue problems we're seeing in this bug would not apply.
Comment 13•13 years ago
|
||
fyi: this queue is backed up again for the last ~6 hours. Example is: Subject: ** PROBLEM alert - dm-webtools02/Tinderbox Queue is WARNING ** Date: Wed, 3 Aug 2011 00:35:17 -0700 (PDT) From: nagios@dm-nagios01.mozilla.org (nagios) To: release@mozilla.com ***** Nagios ***** Notification Type: PROBLEM Service: Tinderbox Queue Host: dm-webtools02 Address: 10.2.74.14 State: WARNING Date/Time: 08-03-2011 00:35:17 Additional Info: WARNING: 3230 Messages Queued. Oldest is 106 minutes old
Comment 15•13 years ago
|
||
(In reply to comment #13) > fyi: this queue is backed up again for the last ~6 hours. Example is: Does comment 12 suggest that we should be finding resources to rework this code? Having fewer people depending on this without actually reducing load seems like a recipe for disaster.
Comment 16•13 years ago
|
||
We should improve our non-tinderbox web views so we can turn off MoCo email to this box instead, I think.
Comment 17•13 years ago
|
||
Seriously and constantly impacting development - not only is there the constant 20-80 minute (with spikes to 4 or 5 hours) lag processing the mail, there's also a lag of unknown duration between when the mail is processed and when the static json that tbpl reads is updated. The way I spotted it was by watching #firebot, where he reads quickparse.txt and reports when something changes, and noticing that 20 minutes later, the change still wasn't in tbpl. You can work around that lag by forcing static page generation, but the fact that I'm considering setting up a cronjob to load those URLs for the trees I'm interested in every 5 or 10 minutes is rather gross.
Severity: normal → blocker
Depends on: 669000
Comment 18•13 years ago
|
||
2 days ago we had to close all trees, yesterday it was bad (1 hour of backlog) but we were able to survive it, today trees are closed again and I don't expect the situation to improve at least till next merge. If the new tbpl is too far away, may we at least move all tryserver messages to a separate queue? Those are usually the largest pile and there is no valid reason they should impact development on final trees.
Comment 19•13 years ago
|
||
(In reply to Marco Bonardo [:mak] (Away 8-14 Aug) from comment #18) > If the new tbpl is too far away, may we at least move all tryserver messages > to a separate queue? Those are usually the largest pile and there is no > valid reason they should impact development on final trees. Can we turn off try tinderbox mail altogether? Aiui, we already have options for developers to a) have results mailed to them b) look at logs on ftp c) get results posted to bugzilla d) look at results in tbpl staging.
Updated•13 years ago
|
Summary: tinderbox queue runner is not keeping up → tinderbox queue runner is not keeping up (dm-webtools02)
Assignee | ||
Comment 20•13 years ago
|
||
This patch adds some *very* rough debug logs (to help us see what is happening when) and also some good timing and other stats per mail file it's not the prettiest perl code, but my goal was to add and not change the underlying processing so we can insert this into production to start gathering data
Attachment #555322 -
Flags: review?(rhelmer)
Attachment #555322 -
Flags: feedback?(nrthomas)
Attachment #555322 -
Flags: feedback?(aki)
Comment 21•13 years ago
|
||
Comment on attachment 555322 [details] [diff] [review] add debug logging and stats logging to processbuild.pl Nice, we should get useful data from this. There would be some ambiguity about which builds were submitting the most data because of varying use of compression on the mail. We could print the logcompression header too, or calculate the uncompressed size (perhaps by summing up $readbytes in compress_log_file). >-use lib "@TINDERBOX_DIR@"; >+use lib "/var/www/iscsi/webtools/tinderbox"; >-$ENV{'PATH'} = "@SETUID_PATH@"; >-my $tinderboxdir = "@TINDERBOX_DIR@"; >+$ENV{'PATH'} = "/usr/bin:/bin"; >+my $tinderboxdir = "/var/www/iscsi/webtools/tinderbox"; aki, rhelmer - these were local changes already on the prod box.
Attachment #555322 -
Flags: feedback?(nrthomas) → feedback+
Reporter | ||
Comment 22•13 years ago
|
||
Comment on attachment 555322 [details] [diff] [review] add debug logging and stats logging to processbuild.pl Review of attachment 555322 [details] [diff] [review]: ----------------------------------------------------------------- Looks good. Are we at all worried about space for the logs? What is the available space on /tmp? ::: processbuild.pl @@ -32,5 @@ > umask 002; > > # setuid globals > -$ENV{'PATH'} = "@SETUID_PATH@"; > -my $tinderboxdir = "@TINDERBOX_DIR@"; Do you want to commit these changes? Or is this just an artifact of the patching process?
Attachment #555322 -
Flags: feedback+
Assignee | ||
Comment 23•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #22) > Comment on attachment 555322 [details] [diff] [review] > add debug logging and stats logging to processbuild.pl > > Review of attachment 555322 [details] [diff] [review]: > ----------------------------------------------------------------- > > Looks good. Are we at all worried about space for the logs? What is the > available space on /tmp? once my tests and pokes are done, i'll move them to a "proper" log directory > > ::: processbuild.pl > @@ -32,5 @@ > > umask 002; > > > > # setuid globals > > -$ENV{'PATH'} = "@SETUID_PATH@"; > > -my $tinderboxdir = "@TINDERBOX_DIR@"; > > Do you want to commit these changes? Or is this just an artifact of the > patching process? that's the diff from CVS (pre-install) to production - so yea, artifacts
Comment 24•13 years ago
|
||
Comment on attachment 555322 [details] [diff] [review] add debug logging and stats logging to processbuild.pl +1 for dustin's comments.
Attachment #555322 -
Flags: feedback?(aki) → feedback+
Comment 25•13 years ago
|
||
Comment on attachment 555322 [details] [diff] [review] add debug logging and stats logging to processbuild.pl >Index: processbuild.pl >=================================================================== > # setuid globals >-$ENV{'PATH'} = "@SETUID_PATH@"; >-my $tinderboxdir = "@TINDERBOX_DIR@"; >+$ENV{'PATH'} = "/usr/bin:/bin"; >+my $tinderboxdir = "/var/www/iscsi/webtools/tinderbox"; Probably don't want to change these ^ Otherwise looks fine, r+ with those reverted (or explanation why they shouldn't be)
Attachment #555322 -
Flags: review?(rhelmer) → review+
Assignee | ||
Comment 26•13 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #25) > Comment on attachment 555322 [details] [diff] [review] > add debug logging and stats logging to processbuild.pl > > >Index: processbuild.pl > >=================================================================== > > # setuid globals > >-$ENV{'PATH'} = "@SETUID_PATH@"; > >-my $tinderboxdir = "@TINDERBOX_DIR@"; > >+$ENV{'PATH'} = "/usr/bin:/bin"; > >+my $tinderboxdir = "/var/www/iscsi/webtools/tinderbox"; > > Probably don't want to change these ^ > :) - yea, I was diff'ing against the code I was running on my test vm > Otherwise looks fine, r+ with those reverted (or explanation why they > shouldn't be) my hope is that I can run this on production only to gather data - then decide if this will go into the codebase
Comment 27•13 years ago
|
||
bug#681836 will help give us more data to figure out where the bottleneck is.
Depends on: 681836
Comment 28•13 years ago
|
||
Related: https://bugzilla.mozilla.org/show_bug.cgi?id=681680
Assignee | ||
Comment 30•13 years ago
|
||
closing this as FIXED - tbpl has been deployed and the number of builders sending tbox email has been drastically reduced. if for some reason this becomes a problem in the future, please shoot me before reopening...
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•