657521 - better disk i/o for more mail-chewing horsepower for tinderbox

Reporter

Description

•

14 years ago

In bug 653969, it seems that tinderbox is having a hard time dealing with the volume of mail it's getting. Aside from longer-term fixes to remove tinderbox from the loop, can we beef this machine up in any way to make it able to process these mails more quickly? I don't know anything about the infra; I'm just the messenger trying to split bug 653969 out into its constituent parts.

Dave Miller [:justdave]

Comment 1

•

14 years ago

I haven't seen any evidence suggesting the delay was actually in the mail processing on tinderbox. We probably won't be able to determine this for sure until the nagios check is in place (bug 657519). The only time people managed to ping the right folks while it was happening turned out to be mail getting queued on an intermediary mail server due to a server outage on one of the mail servers, and not with tinderbox itself. If that's been the case every time this has happened, then more horsepower for tinderbox won't help anything (although more for said intermediate servers might...)

Armen [:armenzg]

Comment 2

•

14 years ago

Which are the intermediate servers?

Amy Rich [:arr] [:arich]

Comment 3

•

14 years ago

Armen: Do we have copies of some sample (over time) of the tardily delivered emails so we can look at the Received headers to see where the bottleneck is?

Armen [:armenzg]

Comment 4

•

14 years ago

(In reply to comment #3) > Armen: Do we have copies of some sample (over time) of the tardily delivered > emails so we can look at the Received headers to see where the bottleneck is? nthomas pointed out to this log: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1304198150.1304207827.17628.gz in bug 653969#c1. Does that give you enough? I don't have anything over time at hand.

Mike Taylor [:bear]

Comment 5

•

14 years ago

After looking at the mail handling cronjob and the number of files being processed I am not seeing any issues with tinderbox's handling of the incoming mail load. Adding hardware to this service will be just tossing hardware at an issue that isn't currently causing any grief. To be sure that this doesn't become a problem then we should add a nagios check to the size of the inbound mail queue as that will be a good check for any tbox issues.

Armen [:armenzg]

Comment 6

•

14 years ago

as per bug 653969 we will leave to IT to determine if more horse power is needed.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WONTFIX

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 7

•

14 years ago

Nagios check caught this when mail backed on tinderbox again today. The machine itself is not heavily loaded, the problem seems to be serial nature of some processing scripts. Bug#659724 filed to track making script(s) more efficient.

Comment 8

•

14 years ago

(In reply to comment #7) > Nagios check caught this when mail backed on tinderbox again today. To be clear, it was not the mail being delayed on delivery but the processing of the logs after delivery.

Mike Taylor [:bear]

Comment 9

•

14 years ago

ran "time ls TraceMonkey" 5 times and got the following times: real 0m3.834s real 0m4.702s real 3m3.399s real 0m2.851s real 0m2.980s

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 10

•

14 years ago

Reopening, because RelEng had to close the tree at 16:40 today due to email processing backlogging here again. While today was a busy day, with busy load, it was not particularly heavy compared to yesterday. Nagios alert let us jump on the problem quickly. While backlog was still happening, bear got (temp) login access to look around, and discovered: * there were no extra large emails to be processed (largest was 13m). * "top" reports that the CPUs are all >50% idle. * disk access was really slow. Bear ran "ls" 4 times in "/var/www/iscsi/webtools/tinderbox/TraceMonkey""; killed first one after 5mins; then one took 3 seconds, one took 14seconds and one took 3 minutes. * in "/var/www/iscsi/webtools/tinderbox/TraceMonkey", bear ran "while true; do ls -la | wc -l ; sleep 30 ; done". Details at: https://spreadsheets.google.com/spreadsheet/ccc?key=0AlIN8kWEeaF0dFhYbV92LXpwU1c5SUVqTUZndmxVV1E&hl=en_US&authkey=CMKR29IJ#gid=0 * much later we check ganglia, and noticed that one CPU was pegged, while the rest were idle. Without us doing anything, we saw the disk i/o improve, so "ls" now completes in <4 seconds. The mail process is starting to catch up again, but we're watching and keeping tree closed until backlog clears. Please investigate. Maybe the disk is sick needing replacement or this is a shared disk being overloaded by something else?

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Summary: more mail-chewing horsepower for tinderbox → better disk i/o for more mail-chewing horsepower for tinderbox

Corey Shields [:cshields]

Comment 11

•

14 years ago

(In reply to comment #10) > Please investigate. Maybe the disk is sick needing replacement or this is a > shared disk being overloaded by something else? I'm new to the issue at hand, but here are some quick observations: You are talking about needing high I/O on an iscsi volume... that is then NFS exported to another node (dm-webtools05) This is not the best recipe for constant high speed disk IO. Can you detail when exactly the errors and issues were? I don't see anything alarming in the system in the 16:00 hour, but there was a spike in IO waits before 16:00, but even then it was nothing compared to other IO blocks earlier in the night before.

Nick Thomas [:nthomas] (UTC+12)

Comment 12

•

14 years ago

It looked from df like /var/www/iscsi was actually a local disk at /dev/sda1.

Mike Taylor [:bear]

Comment 13

•

14 years ago

(In reply to comment #10) > Reopening, because RelEng had to close the tree at 16:40 today due to email > processing backlogging here again. While today was a busy day, with busy > load, it was not particularly heavy compared to yesterday. > > Nagios alert let us jump on the problem quickly. While backlog was still > happening, bear got (temp) login access to look around, and discovered: > * there were no extra large emails to be processed (largest was 13m). > * "top" reports that the CPUs are all >50% idle. > * disk access was really slow. Bear ran "ls" 4 times in > "/var/www/iscsi/webtools/tinderbox/TraceMonkey""; killed first one after > 5mins; then one took 3 seconds, one took 14seconds and one took 3 minutes. > * in "/var/www/iscsi/webtools/tinderbox/TraceMonkey", bear ran "while true; > do ls -la | wc -l ; sleep 30 ; done". Details at: this should be /var/www/iscsi/webtools/tinderbox/data

Corey Shields [:cshields]

Comment 14

•

14 years ago

(In reply to comment #12) > It looked from df like /var/www/iscsi was actually a local disk at /dev/sda1. that's the "scsi" in "iscsi".. ;) the local physical disks are all in an hp array, managed by the cciss driver so they show up under /dev/cciss, which is why the iscsi disk shows up as the first scsi device. [1:0:0:0] disk EQLOGIC 100E-00 4.1 /dev/sda (iscsi mount from the equallogic)

Nick Thomas [:nthomas] (UTC+12)

Comment 15

•

14 years ago

Ok, so I'll ask the same question I always ask when a network storage device shows unexplained slowdowns - what else shares that storage, and is it behaving itself ?

Corey Shields [:cshields]

Comment 16

•

14 years ago

(In reply to comment #15) > Ok, so I'll ask the same question I always ask when a network storage device > shows unexplained slowdowns - what else shares that storage, and is it > behaving itself ? iscsi devices are exclusive so nothing else was sharing the mount point (which is probably why you have it nfs mounted from dm-webtools02 to the other host??) As for the storage system itself we'll look into that. But my point being that this wasn't designed for consistent flow from end to end (but you can count on that for a majority of the time). Stuff like network congestion can cause waits. This is one of the down sides of network based iscsi to, say, fibre channel.

Corey Shields [:cshields]

Comment 17

•

14 years ago

No messages or indication of problems were found on the equallogic. Nothing out of the ordinary on the graphs either.

Armen [:armenzg]

Comment 18

•

14 years ago

Happening again nagios-sjc1: [77] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 11 minutes old

Severity: normal → major

Corey Shields [:cshields]

Comment 19

•

14 years ago

assigning to zandr.. I'm looking at this now, the scrape.pl script is running up the load (along with a bit of nfsd traffic)

Assignee: server-ops → zandr

Corey Shields [:cshields]

Comment 20

•

14 years ago

watching this in real time, the box is processing these serially with serial scripts.. There is plenty more processing power and IO to go around but this needs to be parallelized (the process and the scripts) in order to fix the overall problem.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 21

•

14 years ago

1) From discussions in irc, there are ideas about changing how the scripts do file locking to improve efficiency. However, as these have not been touched in literally years, are not-well-understood code, and have no clear owner, we're putting this dead last on the options list. 2) cshields: I saw your comments about no alerts being thrown, but I didnt understand comment#15, comment#16. Is this tinderbox server running on shared remote disks? If so, has any other system increased load on those shared disks? 3) Got more nagios alerts while commuting, at 8.26 and 8.46. See below. We're watching to see if this will lead to another tree closure. -------- Original Message -------- Subject: ** PROBLEM alert - dm-webtools02/Tinderbox Queue is WARNING ** Date: Thu, 9 Jun 2011 08:26:15 -0700 (PDT) From: nagios@dm-nagios01.mozilla.org (nagios) To: release@mozilla.com ***** Nagios ***** Notification Type: PROBLEM Service: Tinderbox Queue Host: dm-webtools02 Address: 10.2.74.14 State: WARNING Date/Time: 06-09-2011 08:26:15 Additional Info: WARNING: Oldest message is 13 minutes old

matthew zeier [:mrz]

Comment 22

•

14 years ago

I'm lost. Which are the servers involved here? (In reply to comment #21) > 1) From discussions in irc, there are ideas about changing how the scripts > do file locking to improve efficiency. However, as these have not been > touched in literally years, are not-well-understood code, and have no clear > owner, we're putting this dead last on the options list. Isn't this going to be the highest gain? Can't we pull in someone from webdev to help? We do that with other external web projects with amazing results. > 2) cshields: I saw your comments about no alerts being thrown, but I didnt > understand comment#15, comment#16. Is this tinderbox server running on > shared remote disks? If so, has any other system increased load on those > shared disks? Not substantially that only a single host would be impacted. A shared resource under performance constraints affects -all- hosts, not just one. And we have no indications anywhere that that's the case.

Corey Shields [:cshields]

Comment 23

•

14 years ago

(In reply to comment #21) > 1) From discussions in irc, there are ideas about changing how the scripts > do file locking to improve efficiency. However, as these have not been > touched in literally years, are not-well-understood code, and have no clear > owner, we're putting this dead last on the options list. Yes, that is the root of the problem here.. The original design was for a build network years ago, and this is one of the pain points of outgrowing that design. We won't be able to gain much through hardware. I understand that things are breaking and blocking now, and that my suggestion is not a quick fix. > 2) cshields: I saw your comments about no alerts being thrown, but I didnt > understand comment#15, comment#16. Is this tinderbox server running on > shared remote disks? If so, has any other system increased load on those > shared disks? The volume is not shared, it is exclusive to dm-webtools02. However, dm-webtools02 then shares that volume with dm-webtools05 via nfs. Depending on what dm-webtools05 is doing, this could add a lot of load and complication. But the volume itself on the SAN is not shared. The SAN itself is shared, but as I mentioned in Comment 17 there are no problems, issues, or errors in it right now. In short, nothing is operating out of "normal". However, "normal" is an increasing amount of jobs to process and so you are hitting a limit of being able to process them with one thread/core as quickly as they come in (hence the discussion of parallel processing). I'll attach a graph of the load over the past year, you can see the growing trend in load. This isn't something we can solve with better disk i/o, there is more i/o to spare - just need to make use of the other cores to process.

Corey Shields [:cshields]

Comment 24

•

14 years ago

Attached image dm-webtools02 load for past year — Details

Mike Taylor [:bear]

Comment 25

•

14 years ago

marking this bug as closed. IT has shown that none of the systems that are associated with this server are having persistent outages or events that would impact the tinderbox inbound processing queue. Having chewed on last nights debugging and the IT work of today, the fix is either getting the security review for tbpl bug 661365 done or getting 659724 worked on is the solution.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard