Closed
Bug 711176
Opened 13 years ago
Closed 13 years ago
performance on 10.253.0.11:/vol/aus2/ is really bad
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: dparsons)
Details
The backupsnip job for 3.6.24 took 77 minutes. The backupsnip job for 3.6.25, which was only 2-3% larger, took 256 minutes. This is painful because it hurts our turnaround time on releases, and sometimes even delay the shipping of chemspills.
Comment 1•13 years ago
|
||
This won't be resolved on the hardware side until we move to SCL3. There's new, faster storage sitting there.
Sorry, sucks I know.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
Comment 2•13 years ago
|
||
Firefox-9.0-build1 release channels backed up in:
real 5021.03
user 11.87
sys 58.92
Firefox-9.0-build1 test channels backed up in:
real 4656.95
user 11.02
sys 54.27
Reporter | ||
Comment 3•13 years ago
|
||
Firefox 3.6.25 release snippets:
real 2017.79
user 5.55
sys 34.32
Comment 4•13 years ago
|
||
Dan, can you fill in some background on this? What might cause a significant difference in timings like this? Could those be predicted or even estimated based on current load? Snippets are a lot of tiny files, if that narrows the search.
We understand mrz's comments above that this won't get speedy. But if releng can predict "snippets will be ready in 4.5 hours" that's lots better than the release-drivers expecting about an hour and being disappointed.
Rail, Ben, can you give timing info on 9.0 jobs?
Comment 5•13 years ago
|
||
(mid-aired, sorry)
I'm confused about comment 3 - that's about 34 minutes, not 256. I assume that was two different jobs -- did they differ at all?
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Comment 6•13 years ago
|
||
From Ben:
comment #1 was the *backupsnip* for 3.6.25 release channel, my latest comment was the pushsnip
backupsnip tars up a set of snippets onto the same partition; pushsnip copies snippets from one place to another on that partition
Assignee: server-ops → dparsons
Comment 7•13 years ago
|
||
From today's RelEng/IT meeting,
(In reply to matthew zeier [:mrz] from comment #1)
> This won't be resolved on the hardware side until we move to SCL3. There's
> new, faster storage sitting there.
>
> Sorry, sucks I know.
Yep, understood.
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Dan, can you fill in some background on this? What might cause a
> significant difference in timings like this? Could those be predicted or
> even estimated based on current load? Snippets are a lot of tiny files, if
> that narrows the search.
>
> We understand mrz's comments above that this won't get speedy. But if
> releng can predict "snippets will be ready in 4.5 hours" that's lots better
> than the release-drivers expecting about an hour and being disappointed.
>
> Rail, Ben, can you give timing info on 9.0 jobs?
please have the nagios alerts for load/performance on this machine be shared with RelEng in #buildduty. This would help RelEng with eye-ball predicting during a time-critical release, if no better options are available.
Comment 8•13 years ago
|
||
Backupsnip for 9.0.1 test snippets started around Wed Dec 21 3am took 1 hrs, 45 mins, 56 secs.
Comment 9•13 years ago
|
||
$ cat /opt/aus2/snippets/backup/20111221-7-pre-Firefox-9.0.1-build1.time
real 6862.11
user 11.48
sys 58.54
Started around 9:45am PT today
Reporter | ||
Comment 10•13 years ago
|
||
pushsnip for 9.0.1:
-bash-3.2$ cat Firefox-9.0.1-build1-with-8.0-partials-all-platforms.time
real 691.51
user 2.37
sys 13.29
started around 5pm today.
Comment 11•13 years ago
|
||
Backupsnip for 10.0b1 test snippets started Thu Dec 22 08:49:52 2011 and took 1 hrs, 20 mins, 21 secs
Comment 12•13 years ago
|
||
From what I can tell, the remainder of the jobs after comment 0 are reasonably close to the "normal" 80 minutes. Please keep the data coming, and include a timestamp for the job start so that we can correlate any long runs with logs on the netapps.
There's nothing more we can do until we know what the problem is.
Comment 13•13 years ago
|
||
Firefox-10.0b2-build1 backupsnip took:
real 75m58.086s
user 0m11.758s
sys 0m57.745s
date when it was finished:
Thu Dec 29 11:25:03 PST 2011
Comment 14•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Dan, can you fill in some background on this? What might cause a
> significant difference in timings like this? Could those be predicted or
> even estimated based on current load? Snippets are a lot of tiny files, if
> that narrows the search.
>
> We understand mrz's comments above that this won't get speedy. But if
> releng can predict "snippets will be ready in 4.5 hours" that's lots better
> than the release-drivers expecting about an hour and being disappointed.
>
:lerxst: ping?
Assignee | ||
Comment 15•13 years ago
|
||
I spoke to :dustin on IRC about this before the holidays. From what I'm reading here, ~80 minutes is a normal amount of time?
Comment 16•13 years ago
|
||
John, the following was the outcome of our meeting on the topic, where we discussed exactly what Dan said in comment 15:
(quoting Dustin J. Mitchell [:dustin] from comment #12)
> From what I can tell, the remainder of the jobs after comment 0 are
> reasonably close to the "normal" 80 minutes. Please keep the data coming,
> and include a timestamp for the job start so that we can correlate any long
> runs with logs on the netapps.
>
> There's nothing more we can do until we know what the problem is.
Reporter | ||
Comment 17•13 years ago
|
||
From IRC, here's how to get at the files with the times in them:
15:00 < bhearsum> there's a bunch of .time files in /opt/aus2/snippets/staging and /opt/aus2/snippets/backup - accessible through dm-ausstage01
Comment 18•13 years ago
|
||
Results for 10.0b5:
Ran around Wed Jan 18 18:30
backupsnip ( 1 hrs, 17 mins, 23 secs )
pushsnip ( 1 hrs, 47 mins, 9 secs )
Comment 19•13 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #18)
> Results for 10.0b5:
>
> Ran around Wed Jan 18 18:30
>
> backupsnip ( 1 hrs, 17 mins, 23 secs )
> pushsnip ( 1 hrs, 47 mins, 9 secs )
The backupsnip I ran today took longer:
bash-3.2$ time ~/bin/backupsnip Firefox-10.0b5-build1
real 122m3.855s
user 0m13.355s
sys 1m11.133s
Midpoint was about 11:00am PST.
Comment 20•13 years ago
|
||
:lerxst, can you investigate comment 19? It's not quite 256 minutes, but is a good bit longer than usual.
Assignee | ||
Comment 21•13 years ago
|
||
Basically, all bets are off until we get on the new NetApp in scl3. Various things are hitting the sjc netapps harder lately and there's really not much I can do about it :(
Comment 22•13 years ago
|
||
OK, given that info, I think we'll close this. Let's see how build storage works in scl3 -- hopefully a lot faster!
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → INCOMPLETE
Comment 23•13 years ago
|
||
We hit this again for 3.6.26:
https://wiki.mozilla.org/Releases/Firefox_3.6.26/BuildNotes#Push_to_Release_Channel
It was 199mins compared to the 256 mins of 3.6.25 and the 77 mins of 3.6.24.
Comment 24•13 years ago
|
||
I have to reopen the bug because our push times increased drastically these days. For example pushsnip of Firefox 3.6.26 build2 (last Sunday) gave up after 10 hrs, 5 mins, 31 secs. Pushsnip of 11.0b1 took 3 hrs, 45 mins, 35 secs (tonight).
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Comment 25•13 years ago
|
||
Comment #21 is still valid, unfortunately.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → INCOMPLETE
Comment 26•13 years ago
|
||
Development is also being affected.
Could we please investigate if there is anything we can do in the mean time?
philor: log not available is what usually happens when the load on surf is over 125
The trees are currently CLOSED.
If I understand this incorrectly please let me know and we can file a separate bug for the tree closures.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Comment 27•13 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #24)
> I have to reopen the bug because our push times increased drastically these
> days. For example pushsnip of Firefox 3.6.26 build2 (last Sunday) gave up
> after 10 hrs, 5 mins, 31 secs.
This number can't be compared directly to others here, because it's made up of
* try to do the backupsnip
* timeout after 2 hours and retry
* repeat 5 times
It's certainly true that performance was degraded to hit the timeout in the first place.
Comment 28•13 years ago
|
||
(In reply to Dan Parsons [:lerxst] from comment #21)
> Basically, all bets are off until we get on the new NetApp in scl3. Various
> things are hitting the sjc netapps harder lately and there's really not much
> I can do about it :(
1) Can any of the things causing load by hitting the sjc netapps be deferred on release days if we give enough notice?
2) Can you setup alerts to RelEng when the sjc netapps are being overworked? It would be really helpful on release days if RelEng can at least warn people in advance that "disks running slow, the release is delayed a few hours, please plan accordingly".
Assignee | ||
Comment 29•13 years ago
|
||
(In reply to John O'Duinn [:joduinn] from comment #28)
> (In reply to Dan Parsons [:lerxst] from comment #21)
> > Basically, all bets are off until we get on the new NetApp in scl3. Various
> > things are hitting the sjc netapps harder lately and there's really not much
> > I can do about it :(
> 1) Can any of the things causing load by hitting the sjc netapps be deferred
> on release days if we give enough notice?
No, not that I'm aware of.
>
> 2) Can you setup alerts to RelEng when the sjc netapps are being overworked?
> It would be really helpful on release days if RelEng can at least warn
> people in advance that "disks running slow, the release is delayed a few
> hours, please plan accordingly".
The alerts go off on a daily basis. You wouldn't get much useful information by receiving them beyond "the sjc1 netapps are constantly overloaded".
Comment 30•13 years ago
|
||
(In reply to matthew zeier [:mrz] from comment #1)
> This won't be resolved on the hardware side until we move to SCL3. There's
> new, faster storage sitting there.
>
> Sorry, sucks I know.
Can we move releases to a different volume/server for the time being? Do we have any other options?
This is making it very difficult to plan releases, which exacerbates the difficulties associated with such a cross-functional task.
Comment 31•13 years ago
|
||
> Can we move releases to a different volume/server for the time being? Do we
> have any other options?
That's the essence of the issue - there isn't any other storage device.
> This is making it very difficult to plan releases, which exacerbates the
> difficulties associated with such a cross-functional task.
I get that and apologize. We made plans to upgrade/replace storage last spring and have had hardware sitting idle in Santa Clara because we were massively delayed in spinning up a new data center. The existing facility is out of power.
Comment 32•13 years ago
|
||
(In reply to Alex Keybl [:akeybl] from comment #30)
> Can we move releases to a different volume/server for the time being? Do we
> have any other options?
Bug 725838 does just that. Given that bug 725838 and bug 683446 are the short-term plans, and bug scl3-move is the long-term plan, there's no more action here.
Comment 33•13 years ago
|
||
And apparently bugzilla doesn't recognize aliases in text - scl3-move is bug 716047.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Comment 34•13 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #32)
> Bug 725838 does just that. Given that bug 725838 and bug 683446 are the
> short-term plans, and bug scl3-move is the long-term plan, there's no more
> action here.
Specifically we're moving the candidate bits off of the very busy 10.253.0.11 (mpt-netapp-b), and will only have to interact with that when we copy into releases/.
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•