Closed
Bug 521139
Opened 15 years ago
Closed 15 years ago
aus2-staging running slow, causing rel-automation timeout
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: phong)
References
Details
Attachments
(1 file)
755 bytes,
patch
|
bhearsum
:
review+
|
Details | Diff | Splinter Review |
As part of release-automation, we run "backupsnip" on aus2-staging. This typically takes 45-55minutes to complete. However, today, this command started at: Wed Oct 7 14:30:34 2009 PDT and timed out after 2 hours. backupsnip copies many many small files from 10.253.0.11:/vol/aus2/, which justdave says is netapp-b. Would be worth checking the load on that device.
Setting blocker severity as this is impacting the FF3.5.4 release in progress.
Comment 1•15 years ago
|
||
Aravind is the most-likely to actually know what's going on, he's away from a computer at the moment, but should be on in 20 to 30 minutes.
Assignee: server-ops → aravind
Reporter | ||
Comment 2•15 years ago
|
||
(In reply to comment #1)
> Aravind is the most-likely to actually know what's going on, he's away from a
> computer at the moment, but should be on in 20 to 30 minutes.
ok, thanks justdave. coop has been on the hotseat all day, so he's taking this time to go let off some ultimate-steam and get some food.
Reporter | ||
Comment 3•15 years ago
|
||
18:41: aravind, nthomas working on this in irc.
Comment 4•15 years ago
|
||
I started up the actual backupsnip command we need to run in a screen session right on aus2-staging. It's running time-d, and I'll post the numbers here when it completes.
Comment 5•15 years ago
|
||
(In reply to comment #4)
> I started up the actual backupsnip command we need to run in a screen session
> right on aus2-staging. It's running time-d, and I'll post the numbers here when
> it completes.
Took just over one hour to complete, which is in line with backupsnip attempts for previous releases.
Reporter | ||
Comment 6•15 years ago
|
||
summary from irc:
1) backupsnip typically takes 55-60mins. coop did several attempts today which each failed out after 2 hours. fwiw, this worked yesterday normally when lsblakk did the backupsnip for FF3.0.15.
2) coop tried 4 times, running from a few different slaves. Details in https://wiki.mozilla.org/Releases/Firefox_3.5.4/BuildNotes. It is unknown why this last attempt worked.
3) mrz/aravind continuing to investigate.
4) Note: I marked this as blocker because it was blocking work on the FF3.5.4 release. We've now got past that blockage, and have announced the updates snippets to QA. However, this needs to remain blocker until we know this wont bite us again when we do backupsnip and push to beta later this week.
Comment 7•15 years ago
|
||
Running a few backupsnips in a row from the buildbot harness. Logs and timing info can be found here:
http://production-master02.build.mozilla.org:8010/builders/updates/builds/5
Comment 8•15 years ago
|
||
(In reply to comment #7)
> Running a few backupsnips in a row from the buildbot harness. Logs and timing
> info can be found here:
Urghh, make that:
http://production-master02.build.mozilla.org:8010/builders/updates/builds/6
Had a little issue with the date being off in the original command.
Reporter | ||
Comment 9•15 years ago
|
||
Problem *does* reproduce again today.
Start Thu Oct 8 13:25:07 2009
End Thu Oct 8 15:25:07 2009
...timed out at 2hr limit.
Reporter | ||
Comment 10•15 years ago
|
||
(In reply to comment #6)
> 4) Note: I marked this as blocker because it was blocking work on the FF3.5.4
> release. We've now got past that blockage, and have announced the updates
> snippets to QA. However, this needs to remain blocker until we know this wont
> bite us again when we do backupsnip and push to beta later this week.
rel drivers just confirmed we will be pushing FF3.5.4 to beta tomorrow, so we do need this fixed before 11am PDT tomorrow.
Reporter | ||
Comment 11•15 years ago
|
||
(In reply to comment #10)
> (In reply to comment #6)
> > 4) Note: I marked this as blocker because it was blocking work on the FF3.5.4
> > release. We've now got past that blockage, and have announced the updates
> > snippets to QA. However, this needs to remain blocker until we know this wont
> > bite us again when we do backupsnip and push to beta later this week.
>
> rel drivers just confirmed we will be pushing FF3.5.4 to beta tomorrow, so we
> do need this fixed before 11am PDT tomorrow.
Obviously, we are trying out some workarounds in case this isnt resolved by 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one of these workarounds to come through for us in experiments tonight.
Comment 12•15 years ago
|
||
(In reply to comment #11)
> Obviously, we are trying out some workarounds in case this isnt resolved by
> 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one
> of these workarounds to come through for us in experiments tonight.
Whoa, it's not as bad as all that. The backupsnip/pushnip commands we need to run tomorrow get run directly on aus2-staging anyway, so as long as those commands run to completion (even if they take longer), we're still fine. All our local runs have been fine time-wise so far, FWIW.
I want to figure out as much as anybody why spawning the command via buildbot is timing out, but buildbot isn't involved in the beta push tomorrow.
Reporter | ||
Comment 13•15 years ago
|
||
Reproduced problem again running backupsnip:
Start Thu Oct 8 16:59:37 2009
End Thu Oct 8 17:59:23 2009
Status SUCCESS
Start Thu Oct 8 17:59:23 2009
End Thu Oct 8 18:58:58 2009
Status SUCCESS
Start Thu Oct 8 18:58:58 2009
End Thu Oct 8 20:59:01 2009
Status FAIL
(still no idea why this is happening intermittently like this.)
Reporter | ||
Comment 14•15 years ago
|
||
(In reply to comment #12)
> (In reply to comment #11)
> > Obviously, we are trying out some workarounds in case this isnt resolved by
> > 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one
> > of these workarounds to come through for us in experiments tonight.
>
> Whoa, it's not as bad as all that. The backupsnip/pushnip commands we need to
> run tomorrow get run directly on aus2-staging anyway, so as long as those
> commands run to completion (even if they take longer), we're still fine.
Agreed.
> All our local runs have been fine time-wise so far, FWIW.
Oh, ok. Good to know that, coop, thanks.
I only saw the failure from the URL in comment#8 (coop's run), and comment#13 (joduinn's run). Not that it matters now that we have a workaround for tmrw. While this situation is not perfect, at least it feels like we can now muddle through tomorrow.
> I want to figure out as much as anybody why spawning the command via buildbot
> is timing out, but buildbot isn't involved in the beta push tomorrow.
cool.
Comment 15•15 years ago
|
||
Is there anything here that needs ITs attention (Specifically from netapp perspective)?
Reporter | ||
Comment 16•15 years ago
|
||
(In reply to comment #15)
> Is there anything here that needs ITs attention (Specifically from netapp
> perspective)?
Welcome back aravind!
Yep, this is still causing us problems. Its still unclear if this is a VM/ESX host issue, a netapp issue, or something else. Can you and mrz chat to see how best to proceed here?
If it helps, see comments below for times where we hit problems again yesterday. Also, we can reproduce the problem easily if it help you with debugging.
Comment 17•15 years ago
|
||
I think there's something more subtle going on here. As I write, the backupsnip for 3.6b1 build1 has been running for 1hr 50 min and will time out soon. The size of the tarball produced is plausible, and the modification time implies backupsnip took 63 minutes to complete (clock on aus2-staging is correct too). I'm unpacking the backup and diffing against the source to look for differences, but I suspect it's a completely valid backup.
Looking at the running processes on aus2-staging with 'ps auxf', I can't find a process tree of sshd -> bash -> backupsnip. Yet moz2-linux-slave20 (which is the slave that ssh'd in to run the backup) thinks the ssh process is still running. It's in state Ss+, so interruptible sleep, session leader, and foreground process group. Attaching strace to the process isn't any help, we'd have to attach earlier I suspect.
Seems to me there is some sort of hang up problem here.
Comment 18•15 years ago
|
||
Nick: if you end up doing it by hand, try creating the tarball in /tmp (as Aravind suggested last week) and then moving the created tarball into the correct dir. That will avoid a double-hit on the NFS partition.
Anticipating having to do this for 3.5.4 as well.
Comment 20•15 years ago
|
||
I tried making the tar command verbose as a way of keeping data flowing over the ssh link from aus2-staging to linux slave. A buildbot-driven backupsnip succeeded with this change, taking a bit longer at 75 minutes. I've only tried it once so this is certainly not well tested for reliability, but might be useful for 3.5.5.
Attachment #409665 -
Flags: review?(bhearsum)
Comment 21•15 years ago
|
||
Comment on attachment 409665 [details] [diff] [review]
[checked in] Verbose tar
This local change on aus2-staging panicked me good when I was trying to backupsnip for 3.6b1. But if it works, let's do it!
Attachment #409665 -
Flags: review?(bhearsum) → review+
Reporter | ||
Comment 22•15 years ago
|
||
(In reply to comment #20)
> Created an attachment (id=409665) [details]
> Verbose tar
>
> I tried making the tar command verbose as a way of keeping data flowing over
> the ssh link from aus2-staging to linux slave. A buildbot-driven backupsnip
> succeeded with this change, taking a bit longer at 75 minutes. I've only tried
> it once so this is certainly not well tested for reliability, but might be
> useful for 3.5.5.
Thanks Nick. Yep, more verbose is slightly slower, as you noted. However, it could be really useful for figuring out whats causing the timeout. (One theory I had was that the disks were fine, and backupsnip completed successfully, but that the end-of-process got missed by buildbot somehow. Your change should make this easy to spot next time it bites us.)
Comment 23•15 years ago
|
||
Comment on attachment 409665 [details] [diff] [review]
[checked in] Verbose tar
Checking in backupsnip;
/cvsroot/mozilla/tools/release/bin/backupsnip,v <-- backupsnip
new revision: 1.2; previous revision: 1.1
done
aus2-staging:~cltbld/bin updated; staging-stage:~cltbld/bin set to pull from :ext:stgbld@cvs.mozilla.org:/cvsroot rather than the local CVS mirror, and updated. Sorry about that scare Ben, forgot to let you know.
Attachment #409665 -
Attachment description: Verbose tar → [checked in] Verbose tar
Assignee | ||
Comment 24•15 years ago
|
||
[14:44] <nthomas> we've worked around the problem
[14:54] <phong> if that's the case, can we close that bug?
[14:59] <nthomas> guh, that bug is so confusing
[15:00] <nthomas> yeah, why don't you close it and we'll reopen if we hit it again
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•