676822 - Investigate dm-webtools02/04 for causes of slowness

Reporter

Description

•

14 years ago

Per separate email to RelOps, the number of jobs tinderbox is being asked to process is broadly the same as July, but tinderbox hasn't been keeping up for the last three days, eg: dm-webtools02:Tinderbox Queue is WARNING: WARNING: 4610 Messages Queued. Oldest is 123 minutes old Please check for any issues which might be causing mail processing to be slower than usual, not withstanding the known bottleneck discussed in bug 659724. For example, high latency on network storage, spidering on web server causing the same, and so on. This is a blocker because trees are closed and developers cannot land changes.

Amy Rich [:arr] [:arich]

Comment 1

•

14 years ago

The ganglia graphs show a bit more iowait than in previous weeks, but I'd like to see how that lines up with the number of jobs over the past three weeks (since I'd like to compare the ganglia graphs on the month level to see how they trend together). https://ganglia.mozilla.org/sjc1/?r=month&c=Webtools&h=dm-webtools02.mozilla.org&mc=2 Would it be possible to get the stats from the past three weeks, and are there live stats going forward?

Zandr Milewski [:zandr]

Comment 2

•

14 years ago

This is not a blocker. The root cause has been identified, and bug 659724 is tending towards WONTFIX by Releng. (this code's owner) If the root cause is WONTFIXed, there's nothing we can do about it.

Severity: blocker → major

Depends on: 659724

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 3

•

14 years ago

See also bug 676822 (same boxes). Bug 676822 comment 6 may be relevant.

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

14 years ago

Gavin, which bug did you mean (you pointed to this bug)

Armen [:armenzg]

Comment 5

•

14 years ago

I guess the issue that is wanted to be debugged is why the performance was different even though the load was similar. IIUC what we want to know is what to avoid (depending on what was happening last week) until we are completely off tinderbox. arr I see that wio has been happening for the longest time from looking at the halfyear view. [1] Is there anything shared between the two hosts? From week 27 to week 31 it seems to be there is a bump on dm-webtools04 [2]. mxr is running on it. Average iddleness on that machine has also been decreasing [3]. [1] https://ganglia.mozilla.org/sjc1/graph.php?c=Webtools&h=dm-webtools02.mozilla.org&v=8.4&m=cpu_wio&r=halfyear&z=medium&jr=&js=&st=1312836097&vl=%25&ti=CPU%20wio&z=large [2] https://ganglia.mozilla.org/sjc1/graph.php?c=Webtools&h=dm-webtools04.mozilla.org&v=8.4&m=cpu_wio&r=halfyear&z=medium&jr=&js=&st=1312836097&vl=%25&ti=CPU%20wio&z=large [3] https://ganglia.mozilla.org/sjc1/graph.php?c=Webtools&h=dm-webtools04.mozilla.org&v=54.6&m=cpu_aidle&r=halfyear&z=medium&jr=&js=&st=1312837071&vl=%25&ti=CPU%20aidle&z=large

Zandr Milewski [:zandr]

Comment 6

•

14 years ago

There is nothing to debug here. We've been hanging right at the capacity limit for a long time, and all the deeper queues mean is that you're spending more time over the limit than under. The only thing that can be done here is to reduce load. Fewer pushes won't happen, are there branches that could be dropped?

Armen [:armenzg]

Comment 7

•

14 years ago

The main two branches producing load are Try and Mozilla-Inbound accounting 55% and 20% respectively of all load. As of now, I see few things that could move us forward: 1) bug 659724 - grabbing some low hanging fruits 2) stop sending emails to try 3) push harder on getting non-tbox-based tbpl to be production ready 4) determine the root of the spike on dm-webtools04 and reduce the load (assuming that it is affecting the nfs performance and affecting dm-webtools02 indirectly)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

14 years ago

Attached image Graph of job count since July 01 — Details

(In reply to Amy Rich [:arich] from comment #1) > Would it be possible to get the stats from the past three weeks, and are > there live stats going forward? Here's the data which caused me to file this bug. The graphs show the number of Firefox & Fennec jobs, since July 01, compared to the tinderbox backlog (from nagios). Not all of the jobs submit mail to tinderbox, but the proportion that don't is small and bounded in number. I'm getting the job data from querying the buildbot_schedulers db using cruncher:~nthomas/tinder-lag, and stuffing it into gnuplot. Open to suggestions on how to automate that into our other monitoring systems.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 9

•

14 years ago

Forgot to say that I have only partial irc logs in the period 23 July - Aug 1, so the pending graphs are unreliable in that range.

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 10

•

14 years ago

(In reply to Gavin Sharp from comment #3) > See also bug 676822 (same boxes). Bug 676822 comment 6 may be relevant. Sorry, I meant bug 676432 (bug 676432 comment 6).

Zandr Milewski [:zandr]

Comment 11

•

14 years ago

(In reply to Gavin Sharp from comment #10) > (In reply to Gavin Sharp from comment #3) > > See also bug 676822 (same boxes). Bug 676822 comment 6 may be relevant. > > Sorry, I meant bug 676432 (bug 676432 comment 6). Wrong host. There is a tinderbox stage instance on dm-webtools04, but that's not this problem. This is dm-webtools02. We have exceeded the limits of this single-threaded design. There isn't much more to say here.

Zandr Milewski [:zandr]

Comment 12

•

14 years ago

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #7) > 4) determine the root of the spike on dm-webtools04 and reduce the load > (assuming that it is affecting the nfs performance and affecting > dm-webtools02 indirectly) This is not a valid assumption. For reference, the issue on dm-webtools04 *is* https://bugzilla.mozilla.org/show_bug.cgi?id=676432#c6

Amy Rich [:arr] [:arich]

Comment 13

•

14 years ago

To be clear, dm-webtools04 and dm-webtools02 share *no* host infrastructure (not local or shared disk, not CPU/RAM/etc) and do not affect each other in any way in regards to system performance.

Amy Rich [:arr] [:arich]

Comment 14

•

14 years ago

nthomas thinks there might be a correlation between the linux64 repacks last week and the iscsi load on tinderbox (dm-webtools02) during the same time period. webtools02-01 is the volume used for tinderbox and resides primarily on eql01-ps01. nthomas/catlee indicated that the linux64 repacks would have impacted eql01-bm02, which is also on eql-ps01. I've included the nagios logs for tinderbox for August (timestamps are since the epoch) and hopefully nthomas can supply times/dates for the repacks.

Amy Rich [:arr] [:arich]

Comment 15

•

14 years ago

Attached file tinderbox nagios warnings in August 2011 — Details

Amy Rich [:arr] [:arich]

Updated

•

14 years ago

Group: mozilla-corporation-confidential

Amy Rich [:arr] [:arich]

Updated

•

14 years ago

Group: mozilla-corporation-confidential

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 16

•

14 years ago

We had one repack failure for the 6.0b5 release, out of 7 that ran. It started on moz2-linux64-slave05.build.m.o at Thu Aug 4 08:37:14 2011 Pacific, and finished 20 minutes later. The failure we saw indicated slower than usual I/O performance (a timeout after twenty minutes creating a working copy of the mozilla-beta repo). Tinderbox had a backlog of about 1000 messages and 23 minutes during that period.

Amy Rich [:arr] [:arich]

Comment 17

•

14 years ago

FYI we're also talking to Equalogic to see if they can help find/diagnose any possible issues with the storage array.

Amy Rich [:arr] [:arich]

Comment 18

•

14 years ago

We performed an upgrade to dm-webtools02 last night/early this morning (coordinated with buildduty) which went well, but I don't think it's made a noticeable improvement to performance.

Corey Shields [:cshields]

Assignee

Comment 19

•

14 years ago

Update on this: We are working with Dell on a -possible- issue with the SAN that MXR and Tinderbox are using. I say possible because right now there are no indicators to show what is to blame, all we know is that we can see the degredation and it coincides with the bugs we have gotten. More on this when we know more, but we are working on this as I speak.

Assignee: server-ops-releng → cshields

Corey Shields [:cshields]

Assignee

Comment 21

•

14 years ago

To clarify my earlier comment, we knew there was an IO issue - I emphasized a possible issue with the SAN because I was not sure if it existed in the SAN, network, or server layer.. We believe now it was the SAN. We are shuffling around some volumes on the SAN per suggestion from Dell. These operations will take time, so we will update again tomorrow on the outcome. In the meantime, we have seen a vast improvement across the SAN, and though the tinderbox backlog has a long way to go - it is starting to decline.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 22

•

14 years ago

(In reply to Corey Shields [:cshields] from comment #21) > To clarify my earlier comment, we knew there was an IO issue - I emphasized > a possible issue with the SAN because I was not sure if it existed in the > SAN, network, or server layer.. > > We believe now it was the SAN. We are shuffling around some volumes on the > SAN per suggestion from Dell. These operations will take time, so we will > update again tomorrow on the outcome. > > In the meantime, we have seen a vast improvement across the SAN, and though > the tinderbox backlog has a long way to go - it is starting to decline. Queue processing has now caught up with backlog as of 2pm. Unclear if thats because of low-load-on-weekend, or because SAN-shuffling-suggestion worked, or combination, but its great to see.

Zandr Milewski [:zandr]

Comment 23

•

14 years ago

The former, I expect. The SAN shuffling isn't complete.

Corey Shields [:cshields]

Assignee

Comment 24

•

14 years ago

SAN shuffling is still happening in spurts.. We are about half way there. No idea when the rest will be done, we are at the mercy of an "automagical" system. :|

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 25

•

14 years ago

(In reply to Corey Shields [:cshields] from comment #24) > SAN shuffling is still happening in spurts.. We are about half way there. > No idea when the rest will be done, we are at the mercy of an "automagical" > system. :| Its still early (9.15 PDT), and we're already having a 32min delay in processing tinderbox messages. Any way to make this "automagical" system rebalance faster, before the rest of Mozilla developers start doing work, and put us even further behind? ***** Nagios ***** Date/Time: 08-15-2011 09:04:44 WARNING: 1005 Messages Queued. Oldest is 32 minutes old

Severity: major → blocker

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 26

•

14 years ago

> (In reply to Corey Shields [:cshields] from comment #24) >> SAN shuffling is still happening in spurts.. We are about half way there. >> No idea when the rest will be done, we are at the mercy of an "automagical" >> system. :| > > Its still early (9.15 PDT), and we're already having a 32min delay in > processing tinderbox messages. Any way to make this "automagical" system > rebalance faster, before the rest of Mozilla developers start doing work, and > put us even further behind? What is the status of this rebalancing? The processing delay is significant. Date/Time: 08-15-2011 21:26:49 Additional Info: WARNING: 9054 Messages Queued. Oldest is 283 minutes old

Corey Shields [:cshields]

Assignee

Comment 27

•

14 years ago

SAN shuffling is still happening in spurts.. We are about half way there. No idea when the rest will be done, we are at the mercy of an "automagical" system. :|

Corey Shields [:cshields]

Assignee

Comment 28

•

14 years ago

The progress that was made over the weekend in moving tinderbox data off of the exhausted tray and onto other trays seems to have regressed a bit, and 100GB of data was moved back. This is probably why Monday was rough on tinderbox. We have no idea why this happened or what its reasoning is in the shuffling, all we can do is set a preference for the data and let it do its job. One thing that might help is freeing a chunk of disk so there is more buffer for these volumes to move, because right now there is very little. We are working on identifying an old breakpad volume that should not be in use anymore. Hope to have that cleared up today.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 29

•

14 years ago

Date/Time: 08-16-2011 10:37:44 Additional Info: WARNING: 14290 Messages Queued. Oldest is 335 minutes old This is 5-6 hours waittime for results after jobs have completed. Trees closed by bsmedberg, because 5-6hour waiting for results can cause pileup of bad checkins before people know if there is a problem.

Corey Shields [:cshields]

Assignee

Comment 30

•

14 years ago

As of now, all of the tinderbox data is off of the degraded SATA tray and on the 3 SAS trays. In process of working through Dell's recommendations we ended up with an extra instantiator connected to the SAN. When I removed this instantiator it logged all of them out, and reconnecting them left stale file handles - a reboot was done at 17:10 to clear these out (the tree was already closed at this time). As far as the ISCSI connection goes, we should be back to the operational point of before 7/27 in terms of speed and ability. We are now working through the backlog of messages, and the box is cpu bound on this (going back to the single-threaded workflow). I will leave this bug open until we've had a good working day (without tree closure) to see tinderbox run.. But I feel like we have made good progress here. A quick apology to those who have had to deal with this issue - we realize that it throws a kink into the workflow and development. Please understand we have had multiple people working late hours (and most of last weekend) to get this resolved. It has been our top priority.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 31

•

14 years ago

Thanks for the update cshields. As of: Date/Time: 08-16-2011 17:27:37 Additional Info: WARNING: 11746 Messages Queued. Oldest is 474 minutes old Over the last 30mins, tinderbox server has been processing ~300 messages every 10mins. This puts us at clearing the backlog in approx 7 hours (~midnight tonight PDT). After some discussion, we feel its important not to delete these results already inflight, and they are worth processing before reopening the tree. This rate may improve now that migration has completed; we will continue to monitor.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 32

•

14 years ago

aki/bear figured a way to find and remove try-build-emails still waiting to be processed. These were safe to remove, as developers already have those emails sent directly. This removed 1313 pending messages, and brings us closer to reopening the trees.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 33

•

14 years ago

Date/Time: 08-16-2011 19:27:55 WARNING: 9668 Messages Queued. Oldest is 492 minutes old

Zandr Milewski [:zandr]

Comment 34

•

14 years ago

I'm confused. In the email describing the evasive action we were taking, it was indicated that Try amounted to half the load. But in comment 32, removing the pending try emails only pulled 1313 messages out of (at the time) over 10,000. Something doesn't add up. What am I missing?

Mike Taylor [:bear]

Comment 35

•

14 years ago

The messages removed represent an unknown point in time inside the backlog, who knows what percent of Try messages had already been processed. We cannot take this single snapshot of data and extrapolate to other previous backlogs.

Aki Sasaki (not active)

Comment 36

•

14 years ago

(In reply to Zandr Milewski [:zandr] from comment #34) > I'm confused. In the email describing the evasive action we were taking, it > was indicated that Try amounted to half the load. > > But in comment 32, removing the pending try emails only pulled 1313 messages > out of (at the time) over 10,000. > > Something doesn't add up. What am I missing? Possibly bug 679448, which turned off new try tinderbox mail?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 37

•

14 years ago

As of: Date/Time: 08-17-2011 00:08:58 Additional Info: WARNING: 4945 Messages Queued. Oldest is 304 minutes old Over the last 60mins, tinderbox server has been processing ~170 messages every 10mins. Unknown why this processing rate has reduced from earlier processing rate. Despite bulk deletions from the pending queue by aki/bear, this puts us still having significant backlog at ~midnight PDT. We'll continue to monitor, and keep the tree closed, until this backlog clears.

Mike Taylor [:bear]

Comment 38

•

14 years ago

[04:20] <nagios-sjc1> dm-webtools02:Tinderbox Queue is OK: OK: 225 Messages Queued. Oldest is 5 minutes old I opened the trees

Amy Rich [:arr] [:arich]

Comment 39

•

14 years ago

The reason that processing times varied was because tinderbox was constantly getting more messages even though most of the trees were closed. For example, I took a quick look at the mail logs and we processed 4200 new messages from midnight to 4:30PDT when the trees were reopened (in addition to working on the backlog).

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 40

•

14 years ago

(In reply to Amy Rich [:arich] from comment #39) > The reason that processing times varied was because tinderbox was constantly > getting more messages even though most of the trees were closed. For > example, I took a quick look at the mail logs and we processed 4200 new > messages from midnight to 4:30PDT when the trees were reopened (in addition > to working on the backlog). These messages (probably between ~1am and ~5am PDT) are for all the nightly builds across all branches. While we stopped all developers doing checkins for the day by closing the tree, I explicitly chose to not spread the disruption to the entire Mozilla community by cancelling nightlies. The load-vs-disruption balance didnt feel worth it. Unknown what caused variable processing before midnight.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 41

•

14 years ago

(In reply to Mike Taylor [:bear] from comment #38) > [04:20] <nagios-sjc1> dm-webtools02:Tinderbox Queue is OK: OK: 225 Messages > Queued. Oldest is 5 minutes old > > I opened the trees bear: Thats great. Thank you. cshields: I'd like to leave this open for today to see how the disks handle a busy load. Between yesterday's 18hour tree closure and developers now able to work on post-migration changes for the next train, I expect today to be a busy day. Trees are currently open, so downgranding from blocker to normal.

Severity: blocker → normal

Corey Shields [:cshields]

Assignee

Comment 42

•

14 years ago

(In reply to John O'Duinn [:joduinn] from comment #41) > cshields: I'd like to leave this open for today to see how the disks handle > a busy load. Between yesterday's 18hour tree closure and developers now able > to work on post-migration changes for the next train, I expect today to be a > busy day. Agreed.. Please poke zandr if things start to turn south. I'm online now but didn't get much sleep on the plane - not sure how long I will last.

Armen [:armenzg]

Comment 43

•

14 years ago

Notice that normally people for downtimes don't close other trees besides Firefox and Try. (even then I don't think try was marked as closed or the hooks did not work). People kept on pushing frequently to other trees from noon until 5pm. https://build.mozilla.org/buildapi/reports/pushes?starttime=1313521200&endtime=1313578800&int_size=3600

Amy Rich [:arr] [:arich]

Comment 44

•

14 years ago

(In reply to John O'Duinn [:joduinn] from comment #40) > (In reply to Amy Rich [:arich] from comment #39) > > The reason that processing times varied was because tinderbox was constantly > > getting more messages even though most of the trees were closed. For > > example, I took a quick look at the mail logs and we processed 4200 new > > messages from midnight to 4:30PDT when the trees were reopened (in addition > > to working on the backlog). > > These messages (probably between ~1am and ~5am PDT) are for all the nightly > builds across all branches. While we stopped all developers doing checkins > for the day by closing the tree, I explicitly chose to not spread the > disruption to the entire Mozilla community by cancelling nightlies. The > load-vs-disruption balance didnt feel worth it. > > > Unknown what caused variable processing before midnight. We were getting messages the entire time; it wasn't just the nightly builds. On Aug 15 we received a total of 46570 messages On Aug 16 we received a total of 38689 messages Here's the breakdown of new messages per hour on the 16th: 00:00 - 1605 01:00 - 1682 02:00 - 1441 03:00 - 1156 04:00 - 1991 05:00 - 2957 06:00 - 2483 07:00 - 2627 08:00 - 2777 09:00 - 2322 10:00 - 1888 11:00 - 1787 12:00 - 2210 13:00 - 1133 14:00 - 1134 15:00 - 1349 16:00 - 1198 17:00 - 1205 18:00 - 906 19:00 - 1070 20:00 - 1126 21:00 - 887 22:00 - 780 23:00 - 975

Marco Bonardo [:mak]

Comment 45

•

14 years ago

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #43) > Notice that normally people for downtimes don't close other trees besides > Firefox and Try. (even then I don't think try was marked as closed or the > hooks did not work). usually the closed trees for slow messages queue are Firefox, Mozilla-Inbound and Try. Try was closed.

Armen [:armenzg]

Comment 46

•

14 years ago

(In reply to Marco Bonardo [:mak] from comment #45) > (In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment > #43) > > Notice that normally people for downtimes don't close other trees besides > > Firefox and Try. (even then I don't think try was marked as closed or the > > hooks did not work). > > usually the closed trees for slow messages queue are Firefox, > Mozilla-Inbound and Try. Try was closed. I know. I was just trying to point out that even if the Firefox and Try trees were closed there was activity going on in other trees and that explains (hopefully) why new messages were being received even though Firefox and Try were closed.

John Ford [:jhford] CET/CEST Berlin Time

Comment 47

•

14 years ago

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #46) > I know. I was just trying to point out that even if the Firefox and Try > trees were closed there was activity going on in other trees and that > explains (hopefully) why new messages were being received even though > Firefox and Try were closed. New messages from in-process pushes that started before the tree closure could account for some portion of that, especially if the build was all that was in progress when the trees closed.

Severity: normal → blocker

John Ford [:jhford] CET/CEST Berlin Time

Comment 48

•

14 years ago

Sorry, don't know how the i managed to make this a blocker again!

Severity: blocker → normal

Aki Sasaki (not active)

Comment 49

•

14 years ago

These two bugs aren't strictly blockers, but are reducing mail load to tinderbox, so should help.

Depends on: 679448, 679589

Zandr Milewski [:zandr]

Comment 50

•

14 years ago

22:30:04 zandr: so, did something large and anomalous just happen? ... 22:38:47 philor: is that a nice way of saying "philor got home and retriggered fifty failed jobs"? :D We had a spike this evening. The first alert I saw in #build was at 21:53:34, with 340 messages queued, 12 minutes old. The peak reported by nagios was at 22:23, at 602/17min. The check turned green for the last time at 00:05 with 95/6min, and by 01:15 had no backlog. So, we really don't have any headroom at all here. The good news is that the netops work tomorrow should leave us with fewer things to retrigger.

Amy Rich [:arr] [:arich]

Comment 51

•

14 years ago

We seem to be back at the same level of performance we were seeing about a month ago where the bottleneck is once again the single threaded tinderbox process. I think we can close this bug out again if it was tracking the disk issues.

Mike Taylor [:bear]

Comment 52

•

14 years ago

I agree with Amy, we should close this bug as FIXED.

Zandr Milewski [:zandr]

Comment 53

•

14 years ago

Couldn't agree more.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Graph of job count since July 01 14 years ago Nick Thomas [:nthomas] (UTC+12) 29.51 KB, image/png		Details
tinderbox nagios warnings in August 2011 14 years ago Amy Rich [:arr] [:arich] 210.37 KB, text/plain		Details