performance on 10.253.0.11:/vol/aus2/ is really bad

RESOLVED FIXED

Status

--
major
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: bhearsum, Assigned: dparsons)

Tracking

Details

The backupsnip job for 3.6.24 took 77 minutes. The backupsnip job for 3.6.25, which was only 2-3% larger, took 256 minutes. This is painful because it hurts our turnaround time on releases, and sometimes even delay the shipping of chemspills.

Comment 1

7 years ago
This won't be resolved on the hardware side until we move to SCL3.  There's new, faster storage sitting there.

Sorry, sucks I know.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WONTFIX
Firefox-9.0-build1 release channels backed up in:
real 5021.03
user 11.87
sys 58.92

Firefox-9.0-build1 test channels backed up in:
real 4656.95
user 11.02
sys 54.27
Firefox 3.6.25 release snippets:
real 2017.79
user 5.55
sys 34.32
Dan, can you fill in some background on this?  What might cause a significant difference in timings like this?  Could those be predicted or even estimated based on current load?  Snippets are a lot of tiny files, if that narrows the search.

We understand mrz's comments above that this won't get speedy.  But if releng can predict "snippets will be ready in 4.5 hours" that's lots better than the release-drivers expecting about an hour and being disappointed.

Rail, Ben, can you give timing info on 9.0 jobs?
(mid-aired, sorry)

I'm confused about comment 3 - that's about 34 minutes, not 256.  I assume that was two different jobs -- did they differ at all?
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
From Ben:

comment #1 was the *backupsnip* for 3.6.25 release channel, my latest comment was the pushsnip

backupsnip tars up a set of snippets onto the same partition; pushsnip copies snippets from one place to another on that partition
Assignee: server-ops → dparsons
From today's RelEng/IT meeting, 

(In reply to matthew zeier [:mrz] from comment #1)
> This won't be resolved on the hardware side until we move to SCL3.  There's
> new, faster storage sitting there.
> 
> Sorry, sucks I know.
Yep, understood. 


(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Dan, can you fill in some background on this?  What might cause a
> significant difference in timings like this?  Could those be predicted or
> even estimated based on current load?  Snippets are a lot of tiny files, if
> that narrows the search.
> 
> We understand mrz's comments above that this won't get speedy.  But if
> releng can predict "snippets will be ready in 4.5 hours" that's lots better
> than the release-drivers expecting about an hour and being disappointed.
> 
> Rail, Ben, can you give timing info on 9.0 jobs?

please have the nagios alerts for load/performance on this machine be shared with RelEng in #buildduty. This would help RelEng with eye-ball predicting during a time-critical release, if no better options are available.
Backupsnip for 9.0.1 test snippets started around Wed Dec 21 3am took 1 hrs, 45 mins, 56 secs.
$ cat /opt/aus2/snippets/backup/20111221-7-pre-Firefox-9.0.1-build1.time
real 6862.11
user 11.48
sys 58.54

Started around 9:45am PT today
pushsnip for 9.0.1:
-bash-3.2$ cat Firefox-9.0.1-build1-with-8.0-partials-all-platforms.time
real 691.51
user 2.37
sys 13.29

started around 5pm today.
Backupsnip for 10.0b1 test snippets started Thu Dec 22 08:49:52 2011 and took 1 hrs, 20 mins, 21 secs
From what I can tell, the remainder of the jobs after comment 0 are reasonably close to the "normal" 80 minutes.  Please keep the data coming, and include a timestamp for the job start so that we can correlate any long runs with logs on the netapps.

There's nothing more we can do until we know what the problem is.
Firefox-10.0b2-build1 backupsnip took:

real    75m58.086s
user    0m11.758s
sys     0m57.745s

date when it was finished:
Thu Dec 29 11:25:03 PST 2011
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Dan, can you fill in some background on this?  What might cause a
> significant difference in timings like this?  Could those be predicted or
> even estimated based on current load?  Snippets are a lot of tiny files, if
> that narrows the search.
> 
> We understand mrz's comments above that this won't get speedy.  But if
> releng can predict "snippets will be ready in 4.5 hours" that's lots better
> than the release-drivers expecting about an hour and being disappointed.
> 
:lerxst: ping?
(Assignee)

Comment 15

7 years ago
I spoke to :dustin on IRC about this before the holidays. From what I'm reading here, ~80 minutes is a normal amount of time?
John, the following was the outcome of our meeting on the topic, where we discussed exactly what Dan said in comment 15:

(quoting Dustin J. Mitchell [:dustin] from comment #12)
> From what I can tell, the remainder of the jobs after comment 0 are
> reasonably close to the "normal" 80 minutes.  Please keep the data coming,
> and include a timestamp for the job start so that we can correlate any long
> runs with logs on the netapps.
> 
> There's nothing more we can do until we know what the problem is.
From IRC, here's how to get at the files with the times in them:
15:00 < bhearsum> there's a bunch of .time files in /opt/aus2/snippets/staging and /opt/aus2/snippets/backup - accessible through dm-ausstage01
Results for 10.0b5:

Ran around Wed Jan 18 18:30

backupsnip ( 1 hrs, 17 mins, 23 secs )
pushsnip ( 1 hrs, 47 mins, 9 secs )
(In reply to Rail Aliiev [:rail] from comment #18)
> Results for 10.0b5:
> 
> Ran around Wed Jan 18 18:30
> 
> backupsnip ( 1 hrs, 17 mins, 23 secs )
> pushsnip ( 1 hrs, 47 mins, 9 secs )

The backupsnip I ran today took longer:

bash-3.2$ time ~/bin/backupsnip Firefox-10.0b5-build1
real    122m3.855s
user    0m13.355s
sys     1m11.133s

Midpoint was about 11:00am PST.
:lerxst, can you investigate comment 19?  It's not quite 256 minutes, but is a good bit longer than usual.
(Assignee)

Comment 21

7 years ago
Basically, all bets are off until we get on the new NetApp in scl3. Various things are hitting the sjc netapps harder lately and there's really not much I can do about it :(
OK, given that info, I think we'll close this.  Let's see how build storage works in scl3 -- hopefully a lot faster!
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → INCOMPLETE

Comment 23

7 years ago
We hit this again for 3.6.26:
https://wiki.mozilla.org/Releases/Firefox_3.6.26/BuildNotes#Push_to_Release_Channel

It was 199mins compared to the 256 mins of 3.6.25 and the 77 mins of 3.6.24.
I have to reopen the bug because our push times increased drastically these days. For example pushsnip of Firefox 3.6.26 build2 (last Sunday) gave up after 10 hrs, 5 mins, 31 secs. Pushsnip of 11.0b1 took 3 hrs, 45 mins, 35 secs (tonight).
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Comment #21 is still valid, unfortunately.
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → INCOMPLETE

Comment 26

7 years ago
Development is also being affected.
Could we please investigate if there is anything we can do in the mean time?

philor: log not available is what usually happens when the load on surf is over 125

The trees are currently CLOSED.

If I understand this incorrectly please let me know and we can file a separate bug for the tree closures.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
(In reply to Rail Aliiev [:rail] from comment #24)
> I have to reopen the bug because our push times increased drastically these
> days. For example pushsnip of Firefox 3.6.26 build2 (last Sunday) gave up
> after 10 hrs, 5 mins, 31 secs. 

This number can't be compared directly to others here, because it's made up of 
 * try to do the backupsnip 
 * timeout after 2 hours and retry
 * repeat 5 times
It's certainly true that performance was degraded to hit the timeout in the first place.
(In reply to Dan Parsons [:lerxst] from comment #21)
> Basically, all bets are off until we get on the new NetApp in scl3. Various
> things are hitting the sjc netapps harder lately and there's really not much
> I can do about it :(
1) Can any of the things causing load by hitting the sjc netapps be deferred on release days if we give enough notice?

2) Can you setup alerts to RelEng when the sjc netapps are being overworked? It would be really helpful on release days if RelEng can at least warn people in advance that "disks running slow, the release is delayed a few hours, please plan accordingly".
(Assignee)

Comment 29

7 years ago
(In reply to John O'Duinn [:joduinn] from comment #28)
> (In reply to Dan Parsons [:lerxst] from comment #21)
> > Basically, all bets are off until we get on the new NetApp in scl3. Various
> > things are hitting the sjc netapps harder lately and there's really not much
> > I can do about it :(
> 1) Can any of the things causing load by hitting the sjc netapps be deferred
> on release days if we give enough notice?

No, not that I'm aware of.

> 
> 2) Can you setup alerts to RelEng when the sjc netapps are being overworked?
> It would be really helpful on release days if RelEng can at least warn
> people in advance that "disks running slow, the release is delayed a few
> hours, please plan accordingly".

The alerts go off on a daily basis. You wouldn't get much useful information by receiving them beyond "the sjc1 netapps are constantly overloaded".
(In reply to matthew zeier [:mrz] from comment #1)
> This won't be resolved on the hardware side until we move to SCL3.  There's
> new, faster storage sitting there.
> 
> Sorry, sucks I know.

Can we move releases to a different volume/server for the time being? Do we have any other options?

This is making it very difficult to plan releases, which exacerbates the difficulties associated with such a cross-functional task.
> Can we move releases to a different volume/server for the time being? Do we
> have any other options?

That's the essence of the issue - there isn't any other storage device.

> This is making it very difficult to plan releases, which exacerbates the
> difficulties associated with such a cross-functional task.

I get that and apologize.  We made plans to upgrade/replace storage last spring and have had hardware sitting idle in Santa Clara because we were massively delayed in spinning up a new data center.  The existing facility is out of power.
(In reply to Alex Keybl [:akeybl] from comment #30)
> Can we move releases to a different volume/server for the time being? Do we
> have any other options?

Bug 725838 does just that.  Given that bug 725838 and bug 683446 are the short-term plans, and bug scl3-move is the long-term plan, there's no more action here.
And apparently bugzilla doesn't recognize aliases in text - scl3-move is bug 716047.
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED
(In reply to Dustin J. Mitchell [:dustin] from comment #32)
> Bug 725838 does just that.  Given that bug 725838 and bug 683446 are the
> short-term plans, and bug scl3-move is the long-term plan, there's no more
> action here.

Specifically we're moving the candidate bits off of the very busy 10.253.0.11 (mpt-netapp-b), and will only have to interact with that when we copy into releases/.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.