Closed Bug 676822 Opened 13 years ago Closed 13 years ago

Investigate dm-webtools02/04 for causes of slowness

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: cshields)

References

Details

Attachments

(2 files)

Per separate email to RelOps, the number of jobs tinderbox is being asked to process is broadly the same as July, but tinderbox hasn't been keeping up for the last three days, eg:
 dm-webtools02:Tinderbox Queue is WARNING: WARNING: 4610 Messages Queued. Oldest is 123 minutes old

Please check for any issues which might be causing mail processing to be slower than usual, not withstanding the known bottleneck discussed in bug 659724. For example, high latency on network storage, spidering on web server causing the same, and so on.

This is a blocker because trees are closed and developers cannot land changes.
The ganglia graphs show a bit more iowait than in previous weeks, but I'd like to see how that lines up with the number of jobs over the past three weeks (since I'd like to compare the ganglia graphs on the month level to see how they trend together).

https://ganglia.mozilla.org/sjc1/?r=month&c=Webtools&h=dm-webtools02.mozilla.org&mc=2

Would it be possible to get the stats from the past three weeks, and are there live stats going forward?
This is not a blocker. The root cause has been identified, and bug 659724 is tending towards WONTFIX by Releng. (this code's owner)

If the root cause is WONTFIXed, there's nothing we can do about it.
Severity: blocker → major
Depends on: 659724
See also bug 676822 (same boxes). Bug 676822 comment 6 may be relevant.
Gavin, which bug did you mean (you pointed to this bug)
I guess the issue that is wanted to be debugged is why the performance was different even though the load was similar.

IIUC what we want to know is what to avoid (depending on what was happening last week) until we are completely off tinderbox.

arr I see that wio has been happening for the longest time from looking at the halfyear view. [1]

Is there anything shared between the two hosts?

From week 27 to week 31 it seems to be there is a bump on dm-webtools04 [2].
mxr is running on it. Average iddleness on that machine has also been decreasing [3]. 

[1] https://ganglia.mozilla.org/sjc1/graph.php?c=Webtools&h=dm-webtools02.mozilla.org&v=8.4&m=cpu_wio&r=halfyear&z=medium&jr=&js=&st=1312836097&vl=%25&ti=CPU%20wio&z=large
[2] https://ganglia.mozilla.org/sjc1/graph.php?c=Webtools&h=dm-webtools04.mozilla.org&v=8.4&m=cpu_wio&r=halfyear&z=medium&jr=&js=&st=1312836097&vl=%25&ti=CPU%20wio&z=large
[3] https://ganglia.mozilla.org/sjc1/graph.php?c=Webtools&h=dm-webtools04.mozilla.org&v=54.6&m=cpu_aidle&r=halfyear&z=medium&jr=&js=&st=1312837071&vl=%25&ti=CPU%20aidle&z=large
There is nothing to debug here. We've been hanging right at the capacity limit for a long time, and all the deeper queues mean is that you're spending more time over the limit than under.

The only thing that can be done here is to reduce load. Fewer pushes won't happen, are there branches that could be dropped?
The main two branches producing load are Try and Mozilla-Inbound accounting 55% and 20% respectively of all load.

As of now, I see few things that could move us forward:
1) bug 659724 - grabbing some low hanging fruits
2) stop sending emails to try
3) push harder on getting non-tbox-based tbpl to be production ready
4) determine the root of the spike on dm-webtools04 and reduce the load (assuming that it is affecting the nfs performance and affecting dm-webtools02 indirectly)
(In reply to Amy Rich [:arich] from comment #1)
> Would it be possible to get the stats from the past three weeks, and are
> there live stats going forward?

Here's the data which caused me to file this bug. The graphs show the number of Firefox & Fennec jobs, since July 01, compared to the tinderbox backlog (from nagios). Not all of the jobs submit mail to tinderbox, but the proportion that don't is small and bounded in number.

I'm getting the job data from querying the buildbot_schedulers db using cruncher:~nthomas/tinder-lag, and stuffing it into gnuplot. Open to suggestions on how to automate that into our other monitoring systems.
Forgot to say that I have only partial irc logs in the period 23 July - Aug 1, so the pending graphs are unreliable in that range.
(In reply to Gavin Sharp from comment #3)
> See also bug 676822 (same boxes). Bug 676822 comment 6 may be relevant.

Sorry, I meant bug 676432 (bug 676432 comment 6).
(In reply to Gavin Sharp from comment #10)
> (In reply to Gavin Sharp from comment #3)
> > See also bug 676822 (same boxes). Bug 676822 comment 6 may be relevant.
> 
> Sorry, I meant bug 676432 (bug 676432 comment 6).

Wrong host. There is a tinderbox stage instance on dm-webtools04, but that's not this problem. This is dm-webtools02. We have exceeded the limits of this single-threaded design. There isn't much more to say here.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #7)

> 4) determine the root of the spike on dm-webtools04 and reduce the load
> (assuming that it is affecting the nfs performance and affecting
> dm-webtools02 indirectly)

This is not a valid assumption. For reference, the issue on dm-webtools04 *is* 
https://bugzilla.mozilla.org/show_bug.cgi?id=676432#c6
To be clear, dm-webtools04 and dm-webtools02 share *no* host infrastructure (not local or shared disk, not CPU/RAM/etc) and do not affect each other in any way in regards to system performance.
nthomas thinks there might be a correlation between the linux64 repacks last
week and the iscsi load on tinderbox (dm-webtools02) during the same time
period.  webtools02-01 is the volume used for tinderbox and resides primarily
on eql01-ps01.  nthomas/catlee indicated that the linux64 repacks would have
impacted eql01-bm02, which is also on eql-ps01.

I've included the nagios logs for tinderbox for August (timestamps are since
the epoch) and hopefully nthomas can supply times/dates for the repacks.
Group: mozilla-corporation-confidential
Group: mozilla-corporation-confidential
We had one repack failure for the 6.0b5 release, out of 7 that ran. It started on moz2-linux64-slave05.build.m.o at Thu Aug 4 08:37:14 2011 Pacific, and finished 20 minutes later. The failure we saw indicated slower than usual I/O performance (a timeout after twenty minutes creating a working copy of the mozilla-beta repo).

Tinderbox had a backlog of about 1000 messages and 23 minutes during that period.
FYI we're also talking to Equalogic to see if they can help find/diagnose any possible issues with the storage array.
We performed an upgrade to dm-webtools02 last night/early this morning (coordinated with buildduty) which went well, but I don't think it's made a noticeable improvement to performance.
Update on this:

We are working with Dell on a -possible- issue with the SAN that MXR and Tinderbox are using.  I say possible because right now there are no indicators to show what is to blame, all we know is that we can see the degredation and it coincides with the bugs we have gotten.

More on this when we know more, but we are working on this as I speak.
Assignee: server-ops-releng → cshields
To clarify my earlier comment, we knew there was an IO issue - I emphasized a possible issue with the SAN because I was not sure if it existed in the SAN, network, or server layer..

We believe now it was the SAN.  We are shuffling around some volumes on the SAN per suggestion from Dell.  These operations will take time, so we will update again tomorrow on the outcome.

In the meantime, we have seen a vast improvement across the SAN, and though the tinderbox backlog has a long way to go - it is starting to decline.
(In reply to Corey Shields [:cshields] from comment #21)
> To clarify my earlier comment, we knew there was an IO issue - I emphasized
> a possible issue with the SAN because I was not sure if it existed in the
> SAN, network, or server layer..
> 
> We believe now it was the SAN.  We are shuffling around some volumes on the
> SAN per suggestion from Dell.  These operations will take time, so we will
> update again tomorrow on the outcome.
> 
> In the meantime, we have seen a vast improvement across the SAN, and though
> the tinderbox backlog has a long way to go - it is starting to decline.

Queue processing has now caught up with backlog as of 2pm. Unclear if thats because of low-load-on-weekend, or because SAN-shuffling-suggestion worked, or combination, but its great to see.
The former, I expect. The SAN shuffling isn't complete.
SAN shuffling is still happening in spurts..  We are about half way there.  No idea when the rest will be done, we are at the mercy of an "automagical" system.  :|
(In reply to Corey Shields [:cshields] from comment #24)
> SAN shuffling is still happening in spurts..  We are about half way there. 
> No idea when the rest will be done, we are at the mercy of an "automagical"
> system.  :|


Its still early (9.15 PDT), and we're already having a 32min delay in processing tinderbox messages. Any way to make this "automagical" system rebalance faster, before the rest of Mozilla developers start doing work, and put us even further behind?



***** Nagios  *****
Date/Time: 08-15-2011 09:04:44
WARNING: 1005 Messages Queued. Oldest is 32 minutes old
Severity: major → blocker
> (In reply to Corey Shields [:cshields] from comment #24)
>> SAN shuffling is still happening in spurts..  We are about half way there. 
>> No idea when the rest will be done, we are at the mercy of an "automagical"
>> system.  :|
> 
> Its still early (9.15 PDT), and we're already having a 32min delay in
> processing tinderbox messages. Any way to make this "automagical" system
> rebalance faster, before the rest of Mozilla developers start doing work, and
> put us even further behind?

What is the status of this rebalancing? 

The processing delay is significant.

Date/Time: 08-15-2011 21:26:49
Additional Info:
WARNING: 9054 Messages Queued. Oldest is 283 minutes old
SAN shuffling is still happening in spurts..  We are about half way there.  No idea when the rest will be done, we are at the mercy of an "automagical" system.  :|
The progress that was made over the weekend in moving tinderbox data off of the exhausted tray and onto other trays seems to have regressed a bit, and 100GB of data was moved back.  This is probably why Monday was rough on tinderbox.

We have no idea why this happened or what its reasoning is in the shuffling, all we can do is set a preference for the data and let it do its job.

One thing that might help is freeing a chunk of disk so there is more buffer for these volumes to move, because right now there is very little.  We are working on identifying an old breakpad volume that should not be in use anymore.  Hope to have that cleared up today.
Date/Time: 08-16-2011 10:37:44
Additional Info:
WARNING: 14290 Messages Queued. Oldest is 335 minutes old

This is 5-6 hours waittime for results after jobs have completed. 

Trees closed by bsmedberg, because 5-6hour waiting for results can cause pileup of bad checkins before people know if there is a problem.
As of now, all of the tinderbox data is off of the degraded SATA tray and on the 3 SAS trays.

In process of working through Dell's recommendations we ended up with an extra instantiator connected to the SAN.  When I removed this instantiator it logged all of them out, and reconnecting them left stale file handles - a reboot was done at 17:10 to clear these out (the tree was already closed at this time).

As far as the ISCSI connection goes, we should be back to the operational point of before 7/27 in terms of speed and ability.

We are now working through the backlog of messages, and the box is cpu bound on this (going back to the single-threaded workflow).

I will leave this bug open until we've had a good working day (without tree closure) to see tinderbox run..  But I feel like we have made good progress here.

A quick apology to those who have had to deal with this issue - we realize that it throws a kink into the workflow and development.  Please understand we have had multiple people working late hours (and most of last weekend) to get this resolved.  It has been our top priority.
Thanks for the update cshields.


As of:
Date/Time: 08-16-2011 17:27:37
Additional Info:
WARNING: 11746 Messages Queued. Oldest is 474 minutes old

Over the last 30mins, tinderbox server has been processing ~300 messages every 10mins. This puts us at clearing the backlog in approx 7 hours (~midnight tonight PDT). After some discussion, we feel its important not to delete these results already inflight, and they are worth processing before reopening the tree.

This rate may improve now that migration has completed; we will continue to monitor.
aki/bear figured a way to find and remove try-build-emails still waiting to be processed. These were safe to remove, as developers already have those emails sent directly.

This removed 1313 pending messages, and brings us closer to reopening the trees.
Date/Time: 08-16-2011 19:27:55
WARNING: 9668 Messages Queued. Oldest is 492 minutes old
I'm confused. In the email describing the evasive action we were taking, it was indicated that Try amounted to half the load.

But in comment 32, removing the pending try emails only pulled 1313 messages out of (at the time) over 10,000.

Something doesn't add up. What am I missing?
The messages removed represent an unknown point in time inside the backlog, who knows what percent of Try messages had already been processed.

We cannot take this single snapshot of data and extrapolate to other previous backlogs.
(In reply to Zandr Milewski [:zandr] from comment #34)
> I'm confused. In the email describing the evasive action we were taking, it
> was indicated that Try amounted to half the load.
> 
> But in comment 32, removing the pending try emails only pulled 1313 messages
> out of (at the time) over 10,000.
> 
> Something doesn't add up. What am I missing?

Possibly bug 679448, which turned off new try tinderbox mail?
As of:
Date/Time: 08-17-2011 00:08:58
Additional Info:
WARNING: 4945 Messages Queued. Oldest is 304 minutes old

Over the last 60mins, tinderbox server has been processing ~170 messages every 10mins. Unknown why this processing rate has reduced from earlier processing rate. 

Despite bulk deletions from the pending queue by aki/bear, this puts us still having significant backlog at ~midnight PDT. 

We'll continue to monitor, and keep the tree closed, until this backlog clears.
[04:20]  <nagios-sjc1> dm-webtools02:Tinderbox Queue is OK: OK: 225 Messages Queued. Oldest is 5 minutes old

I opened the trees
The reason that processing times varied was because tinderbox was constantly getting more messages even though most of the trees were closed.  For example, I took a quick look at the mail logs and we processed 4200 new messages from midnight to 4:30PDT when the trees were reopened (in addition to working on the backlog).
(In reply to Amy Rich [:arich] from comment #39)
> The reason that processing times varied was because tinderbox was constantly
> getting more messages even though most of the trees were closed.  For
> example, I took a quick look at the mail logs and we processed 4200 new
> messages from midnight to 4:30PDT when the trees were reopened (in addition
> to working on the backlog).

These messages (probably between ~1am and ~5am PDT) are for all the nightly builds across all branches. While we stopped all developers doing checkins for the day by closing the tree, I explicitly chose to not spread the disruption to the entire Mozilla community by cancelling nightlies. The load-vs-disruption balance didnt feel worth it.


Unknown what caused variable processing before midnight.
(In reply to Mike Taylor [:bear] from comment #38)
> [04:20]  <nagios-sjc1> dm-webtools02:Tinderbox Queue is OK: OK: 225 Messages
> Queued. Oldest is 5 minutes old
> 
> I opened the trees

bear: Thats great. Thank you. 

cshields: I'd like to leave this open for today to see how the disks handle a busy load. Between yesterday's 18hour tree closure and developers now able to work on post-migration changes for the next train, I expect today to be a busy day. 


Trees are currently open, so downgranding from blocker to normal.
Severity: blocker → normal
(In reply to John O'Duinn [:joduinn] from comment #41)
> cshields: I'd like to leave this open for today to see how the disks handle
> a busy load. Between yesterday's 18hour tree closure and developers now able
> to work on post-migration changes for the next train, I expect today to be a
> busy day. 

Agreed..  Please poke zandr if things start to turn south.  I'm online now but didn't get much sleep on the plane - not sure how long I will last.
Notice that normally people for downtimes don't close other trees besides Firefox and Try. (even then I don't think try was marked as closed or the hooks did not work).

People kept on pushing frequently to other trees from noon until 5pm.

https://build.mozilla.org/buildapi/reports/pushes?starttime=1313521200&endtime=1313578800&int_size=3600
(In reply to John O'Duinn [:joduinn] from comment #40)
> (In reply to Amy Rich [:arich] from comment #39)
> > The reason that processing times varied was because tinderbox was constantly
> > getting more messages even though most of the trees were closed.  For
> > example, I took a quick look at the mail logs and we processed 4200 new
> > messages from midnight to 4:30PDT when the trees were reopened (in addition
> > to working on the backlog).
> 
> These messages (probably between ~1am and ~5am PDT) are for all the nightly
> builds across all branches. While we stopped all developers doing checkins
> for the day by closing the tree, I explicitly chose to not spread the
> disruption to the entire Mozilla community by cancelling nightlies. The
> load-vs-disruption balance didnt feel worth it.
> 
> 
> Unknown what caused variable processing before midnight.

We were getting messages the entire time; it wasn't just the nightly builds.

On Aug 15 we received a total of 46570 messages
On Aug 16 we received a total of 38689 messages

Here's the breakdown of new messages per hour on the 16th:

00:00 - 1605
01:00 - 1682
02:00 - 1441
03:00 - 1156
04:00 - 1991
05:00 - 2957
06:00 - 2483
07:00 - 2627
08:00 - 2777
09:00 - 2322
10:00 - 1888
11:00 - 1787
12:00 - 2210
13:00 - 1133
14:00 - 1134
15:00 - 1349
16:00 - 1198
17:00 - 1205
18:00 - 906
19:00 - 1070
20:00 - 1126
21:00 - 887
22:00 - 780
23:00 - 975
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #43)
> Notice that normally people for downtimes don't close other trees besides
> Firefox and Try. (even then I don't think try was marked as closed or the
> hooks did not work).

usually the closed trees for slow messages queue are Firefox, Mozilla-Inbound and Try. Try was closed.
(In reply to Marco Bonardo [:mak] from comment #45)
> (In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment
> #43)
> > Notice that normally people for downtimes don't close other trees besides
> > Firefox and Try. (even then I don't think try was marked as closed or the
> > hooks did not work).
> 
> usually the closed trees for slow messages queue are Firefox,
> Mozilla-Inbound and Try. Try was closed.

I know. I was just trying to point out that even if the Firefox and Try trees were closed there was activity going on in other trees and that explains (hopefully) why new messages were being received even though Firefox and Try were closed.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #46)
> I know. I was just trying to point out that even if the Firefox and Try
> trees were closed there was activity going on in other trees and that
> explains (hopefully) why new messages were being received even though
> Firefox and Try were closed.

New messages from in-process pushes that started before the tree closure could account for some portion of that, especially if the build was all that was in progress when the trees closed.
Severity: normal → blocker
Sorry, don't know how the i managed to make this a blocker again!
Severity: blocker → normal
These two bugs aren't strictly blockers, but are reducing mail load to tinderbox, so should help.
Depends on: 679448, 679589
22:30:04 zandr: so, did something large and anomalous just happen?
...
22:38:47 philor: is that a nice way of saying "philor got home and retriggered fifty failed jobs"?

:D

We had a spike this evening. The first alert I saw in #build was at 21:53:34, with 340 messages queued, 12 minutes old. The peak reported by nagios was at 22:23, at 602/17min. The check turned green for the last time at 00:05 with 95/6min, and by 01:15 had no backlog.

So, we really don't have any headroom at all here.

The good news is that the netops work tomorrow should leave us with fewer things to retrigger.
We seem to be back at the same level of performance we were seeing about a month ago where the bottleneck is once again the single threaded tinderbox process.  I think we can close this bug out again if it was tracking the disk issues.
I agree with Amy, we should close this bug as FIXED.
Couldn't agree more.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: