Closed Bug 653969 Opened 14 years ago Closed 14 years ago

Tinderbox showing long delays in mail processing

Categories

(Release Engineering :: General, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [buildduty])

Spinning off from bug 652812 because I'm not entirely sure this is the same thing. It's hard to really tell what might be happening from just looking at what shows up on the web pages, but tinderbox.mozilla.org is acting like it's either having trouble processing its mail, or is getting delays receiving it. The two symptoms are builds/tests which tinderbox believes took two or three hours to complete, which really only took 15-30 minutes, and builds/tests which it shows as never having completed, when in at least once case for a short time I saw it showing the test as having completed, and then it backed up to saying that it was running, which would make sense if it received/processed the "build finished" email, then later received/processed the "build started" email. The net result is that for every push, we're adding two or three hours onto the time before the pusher is off the hook, plus we're just throwing the results of some tests on the floor, unless someone does an end run around tinderbox and reads the full logs uploaded to ftp.m.o or uses releng's self-serve api as a weird sort of waterfall to find out old results rather than to create new ones.
I've looked at the buildbot master log for the example in bug 652812 comment #17. AFAICT there are no errors reported for sending the end of build email to tinderbox.m.o, and it happened promptly. The message as about 1.1MB in size. Tinderbox log: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1304198150.1304207827.17628.gz Actual job: 'Linux mozilla-central leak test build' on rev deb13b9167ad Build slave: linux-ix-slave30 on buildbot-master5 According to the buildbot side, we had start time: Sat Apr 30 14:15:50 2011 end time: Sat Apr 30 14:53:16 2011 elapsed : 37 mins, 25 secs Tinderbox has: Started 2011/04/30 14:15, finished 2011/04/30 17:09 2 hours, 53 minutes elapsed buildbot-master5 is in SCL, but I'm also seeing delays for SJC based masters. We're sending to tinderbox-daemon@m.o via a relay of mail.build.mozilla.org (CNAME for smtp.m.o). Please check on the health of those mail servers and the dm-webtools box which is processing tinder-mail.
Assignee: server-ops → phong
justdave: can you look into this more? there was a problem with webtools02 yesterday around this time.
Assignee: phong → justdave
Do we have a nagios checks on the length of the mail queue that tinderbox has yet to process ?
mail spool files for incoming mail are all at /var/www/iscsi/webtools/tinderbox/data and all start with "tbx.". An appropriate nagios check would look for the oldest tbx.* file in that directory and alert if the oldest file is over 5 minutes old. The cron job that processes the queue runs once per minute. I'm not aware that we have such a check currently. There's nothing over a minute old in there currently...
(In reply to comment #2) > there was a problem with webtools02 yesterday around this time. This would be a likely cause. Mail would have been spooled on smtp.mozilla.org while webtools02 was down, and there would have been a delay on re-attempting it after webtools02 came back.
Seen on our masters: 2011-05-02 06:02:56-0700 [-] Unhandled Error Traceback (most recent call last): Failure: twisted.internet.error.DNSLookupError: DNS lookup failed: address 'mail.build.mozilla.org' not found: [Errno -2] Name or service not known.
Armen: was that just briefly this morning around 6:00PDT for the masters in scl1 or is this ongoing? We rebooted the ns servers in scl1 at that time. If you could provide some hostnames along with the logs, that would be very helpful.
There was a problem with one of the nameservers in scl1 this morning (bad route and netmask) that's just been corrected.
As I mentioned on IRC I thought the exception was related to the tinderbox mail issue rather than the reboot scheduled this morning. See bug 653769 for the scheduled reboot and bug 654139 for the fix.
armen: can you verify that these errors have gone away now?
(In reply to comment #10) > armen: can you verify that these errors have gone away now? I can confirm that we (releng) are not getting anymore emails from our masters for the twisted.internet.error.DNSLookupError exception. I can't say anything about the tinderbox mail processing delay.
Still seeing some issues here: The following exceptions (total 2) were detected on buildbot-master04.build.scl1.mozilla.com bm04-tests1: Exception in /builds/buildbot/tests1/master/twistd.log: 2011-05-02 11:53:36-0700 [ESMTPSender,client] Unhandled Error Traceback (most recent call last): Failure: twisted.mail.smtp.SMTPDeliveryError: 421 No recipients accepted tinderbox-daemon@tinderbox.mozilla.org: 421 4.4.2 dm-mail03.mozilla.org Error: timeout exceeded >>> MAIL FROM:<talos.buildbot@build.mozilla.org> <<< 250 2.1.0 Ok >>> RCPT TO:<tinderbox-daemon@tinderbox.mozilla.org> <<< 421 4.4.2 dm-mail03.mozilla.org Error: timeout exceeded -------------------------------------------------------------------------------- Exception in /builds/buildbot/tests1/master/twistd.log: 2011-05-02 11:53:36-0700 [ESMTPSender,client] Unhandled Error Traceback (most recent call last): Failure: twisted.mail.smtp.SMTPDeliveryError: 421 4.4.2 dm-mail03.mozilla.org Error: timeout exceeded >>> RCPT TO:<tinderbox-daemon@tinderbox.mozilla.org> <<< 250 2.1.5 Ok >>> DATA <<< 421 4.4.2 dm-mail03.mozilla.org Error: timeout exceeded
(In reply to comment #12) > Still seeing some issues here: Dustin filed this as bug 654202 in case the issues aren't related.
See Also: → 654202
Whiteboard: [buildduty]
(In reply to comment #4) > I'm not aware that we have such a check currently. There's nothing over a > minute old in there currently... I think some sort of check would be great to have. Preferably one that can alert like this CRITICAL - 12345 tinderbox mails unprocessed, oldest is 123 minutes old rather than '7 mailboxes unprocessed'. How hard would the former be ? We keep wanting to kill tinderbox so it's not worth a lot of effort.
According to bug 655314, tinderbox was lagging at reading its mail at 11:00 today.
(In reply to comment #14) > (In reply to comment #4) > > I'm not aware that we have such a check currently. There's nothing over a > > minute old in there currently... > > I think some sort of check would be great to have. Preferably one that can > alert like this > CRITICAL - 12345 tinderbox mails unprocessed, oldest is 123 minutes old > rather than '7 mailboxes unprocessed'. How hard would the former be ? We > keep wanting to kill tinderbox so it's not worth a lot of effort. This is getting a pressing issue as developers are being delayed in their ability to do their work as seen in bug 652812 comment 24. I am trying to figure out if we can solve the problem in our side by changing how we report things but I would like to know the following: * what is the ETA to add such check? is it possible? * what can we do to add more horse power to tinder-mail processing? * what do you think we can do on our side to help circumvent the problem?
OK, I've split this bug into three parts, based on comment 17. Bug 657519 adds a nagios check for the tinderbox mail queue; Bug 657521 adds horsepower to tinderbox's mail processing; and This bug is for the release-engineering side of things, if any.
Assignee: justdave → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
I just looked at the mail handling side of tinderbox with Amy and I don't think tinderbox is having any issues handling mail. The way tinderbox handles mail would not cause symptoms like are being described. I say we should close/morph this bug and look into how tbpl gets the build-has-ended information and find out what generates that data.
TBPL gets its data on build completion from tinderbox; any issues should also be visible in the tinderbox UI. That said, from comments above it looks like this has happened a bunch of times in the last few weeks but isn't necessarily happening right now.
Yep, we are going to have to search the entire tbox flow when it next happens to find the root cause. My search above was to see if we had any systemic issue that would cause something on the tbox server itself to rise up and cause grief but i'm not seeing that. Next time this happens please poke me directly so I can start searching.
With the check at least we will know when it is happening. philor explains that tbpl notices that a build is done when builds-running.js indicates so rather than when tinderbox tells you that it is done processing the log. tbpl can know that the build finished but not if it finished green/red/orange or what. See bug 652812 comment 17. Also see comment 25 to see further explained that builds-running.js and tinderbox are disconnected from each other. (In reply to comment #20) > TBPL gets its data on build completion from tinderbox; any issues should > also be visible in the tinderbox UI. > From what philor explains I understand that on tbpl the job disappears (because builds-running.js says it completed) while on tinderbox would still show as yellow/running.
After digging through this, there are several issues: 1) there was a problem with dm-webtools02 that caused mail delays for a short period of time. This is fixed. 2) there was a problem with one of the nameservers in scl1 and DNS timeouts for mail.build.mozilla.org. This is fixed. 3) There are occasional SMTP timeouts talking to dm-mail03 as described in the forked bug 654202. If the timeouts are long and consistent enough, they will be caught by the SMTP check on dm-mail03. TinderboxMailNotifier should handle temporary failures correctly, or a real MTA that handles transient errors should be used. The check we just added to check the time/size of files in dm-webtools02:/var/www/iscsi/webtools/tinderbox/data is only checking how well dm-webtools02 is processing these mail messages AFTER they've already been delivered, and so is not a good check of any mail flow issues, but is, perhaps, still useful to see if the /var/www/iscsi/webtools/tinderbox/processbuild.pl script (which runs every minute) is falling far behind. I believe that this and associated bugs about beefing up the infrastructure (unless infra finds that dm-mail03 is having resource issues) should be closed and releng should focus on fixing 654202 so that the message delivery conforms to RFC specs.
(In reply to comment #23) > After digging through this, there are several issues: > > 1) there was a problem with dm-webtools02 that caused mail delays for a > short period of time. This is fixed. > > 2) there was a problem with one of the nameservers in scl1 and DNS timeouts > for mail.build.mozilla.org. This is fixed. > > 3) There are occasional SMTP timeouts talking to dm-mail03 as described in > the forked bug 654202. If the timeouts are long and consistent enough, they > will be caught by the SMTP check on dm-mail03. TinderboxMailNotifier should > handle temporary failures correctly, or a real MTA that handles transient > errors should be used. > > The check we just added to check the time/size of files in > dm-webtools02:/var/www/iscsi/webtools/tinderbox/data is only checking how > well dm-webtools02 is processing these mail messages AFTER they've already > been delivered, and so is not a good check of any mail flow issues, but is, > perhaps, still useful to see if the > /var/www/iscsi/webtools/tinderbox/processbuild.pl script (which runs every > minute) is falling far behind. If the issues are fixed and we now have eyes for when it happens again then I think it makes sense to close this bug. Thanks arr et all. > > I believe that this and associated bugs about beefing up the infrastructure > (unless infra finds that dm-mail03 is having resource issues) should be > closed and releng should focus on fixing 654202 so that the message delivery > conforms to RFC specs. Sounds good to me. If anyone objects please REOPEN.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
The check went off: nagios-sjc1: [75] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 12 minutes old I will figure out with IT what we should be doing now that it came up.
So the check proves that it is the queue runner that is slow - that is, mail is being delivered to tinderbox in a timely fashion, and it just can't keep up in some cases. I opened bug 659724 to work on that.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.