aus2-staging running slow, causing rel-automation timeout

RESOLVED FIXED

Status

--
blocker
RESOLVED FIXED
9 years ago
4 years ago

People

(Reporter: joduinn, Assigned: phong)

Tracking

Details

Attachments

(1 attachment)

As part of release-automation, we run "backupsnip" on aus2-staging. This typically takes 45-55minutes to complete. However, today, this command started at: Wed Oct 7 14:30:34 2009 PDT and timed out after 2 hours. backupsnip copies many many small files from 10.253.0.11:/vol/aus2/, which justdave says is netapp-b. Would be worth checking the load on that device. 

Setting blocker severity as this is impacting the FF3.5.4 release in progress.
Aravind is the most-likely to actually know what's going on, he's away from a computer at the moment, but should be on in 20 to 30 minutes.
Assignee: server-ops → aravind
(In reply to comment #1)
> Aravind is the most-likely to actually know what's going on, he's away from a
> computer at the moment, but should be on in 20 to 30 minutes.

ok, thanks justdave. coop has been on the hotseat all day, so he's taking this time to go let off some ultimate-steam and get some food.
18:41: aravind, nthomas working on this in irc.
I started up the actual backupsnip command we need to run in a screen session right on aus2-staging. It's running time-d, and I'll post the numbers here when it completes.
(In reply to comment #4)
> I started up the actual backupsnip command we need to run in a screen session
> right on aus2-staging. It's running time-d, and I'll post the numbers here when
> it completes.

Took just over one hour to complete, which is in line with backupsnip attempts for previous releases.
summary from irc:

1) backupsnip typically takes 55-60mins. coop did several attempts today which each failed out after 2 hours. fwiw, this worked yesterday normally when lsblakk did the backupsnip for FF3.0.15.

2) coop tried 4 times, running from a few different slaves. Details in https://wiki.mozilla.org/Releases/Firefox_3.5.4/BuildNotes. It is unknown why this last attempt worked. 

3) mrz/aravind continuing to investigate.

4) Note: I marked this as blocker because it was blocking work on the FF3.5.4 release. We've now got past that blockage, and have announced the updates snippets to QA. However, this needs to remain blocker until we know this wont bite us again when we do backupsnip and push to beta later this week.
Running a few backupsnips in a row from the buildbot harness. Logs and timing info can be found here:

http://production-master02.build.mozilla.org:8010/builders/updates/builds/5
(In reply to comment #7)
> Running a few backupsnips in a row from the buildbot harness. Logs and timing
> info can be found here:

Urghh, make that:

http://production-master02.build.mozilla.org:8010/builders/updates/builds/6

Had a little issue with the date being off in the original command.
Problem *does* reproduce again today.

Start	Thu Oct 8 13:25:07 2009
End	Thu Oct 8 15:25:07 2009
...timed out at 2hr limit.
(In reply to comment #6)
> 4) Note: I marked this as blocker because it was blocking work on the FF3.5.4
> release. We've now got past that blockage, and have announced the updates
> snippets to QA. However, this needs to remain blocker until we know this wont
> bite us again when we do backupsnip and push to beta later this week.

rel drivers just confirmed we will be pushing FF3.5.4 to beta tomorrow, so we do need this fixed before 11am PDT tomorrow.
(In reply to comment #10)
> (In reply to comment #6)
> > 4) Note: I marked this as blocker because it was blocking work on the FF3.5.4
> > release. We've now got past that blockage, and have announced the updates
> > snippets to QA. However, this needs to remain blocker until we know this wont
> > bite us again when we do backupsnip and push to beta later this week.
> 
> rel drivers just confirmed we will be pushing FF3.5.4 to beta tomorrow, so we
> do need this fixed before 11am PDT tomorrow.

Obviously, we are trying out some workarounds in case this isnt resolved by 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one of these workarounds to come through for us in experiments tonight.
(In reply to comment #11) 
> Obviously, we are trying out some workarounds in case this isnt resolved by
> 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one
> of these workarounds to come through for us in experiments tonight.

Whoa, it's not as bad as all that. The backupsnip/pushnip commands we need to run tomorrow get run directly on aus2-staging anyway, so as long as those commands run to completion (even if they take longer), we're still fine. All our local runs have been fine time-wise so far, FWIW.

I want to figure out as much as anybody why spawning the command via buildbot is timing out, but buildbot isn't involved in the beta push tomorrow.
Reproduced problem again running backupsnip:


Start	Thu Oct 8 16:59:37 2009
End	Thu Oct 8 17:59:23 2009
Status  SUCCESS

Start   Thu Oct 8 17:59:23 2009
End	Thu Oct 8 18:58:58 2009
Status  SUCCESS

Start	Thu Oct 8 18:58:58 2009
End	Thu Oct 8 20:59:01 2009
Status  FAIL

(still no idea why this is happening intermittently like this.)
(In reply to comment #12)
> (In reply to comment #11) 
> > Obviously, we are trying out some workarounds in case this isnt resolved by
> > 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one
> > of these workarounds to come through for us in experiments tonight.
> 
> Whoa, it's not as bad as all that. The backupsnip/pushnip commands we need to
> run tomorrow get run directly on aus2-staging anyway, so as long as those
> commands run to completion (even if they take longer), we're still fine. 
Agreed.


> All our local runs have been fine time-wise so far, FWIW.
Oh, ok. Good to know that, coop, thanks. 

I only saw the failure from the URL in comment#8 (coop's run), and comment#13 (joduinn's run). Not that it matters now that we have a workaround for tmrw. While this situation is not perfect, at least it feels like we can now muddle through tomorrow.


> I want to figure out as much as anybody why spawning the command via buildbot
> is timing out, but buildbot isn't involved in the beta push tomorrow.

cool.
Is there anything here that needs ITs attention (Specifically from netapp perspective)?
(In reply to comment #15)
> Is there anything here that needs ITs attention (Specifically from netapp
> perspective)?

Welcome back aravind! 

Yep, this is still causing us problems. Its still unclear if this is a VM/ESX host issue, a netapp issue, or something else. Can you and mrz chat to see how best to proceed here? 

If it helps, see comments below for times where we hit problems again yesterday. Also, we can reproduce the problem easily if it help you with debugging.
I think there's something more subtle going on here. As I write, the backupsnip for 3.6b1 build1 has been running for 1hr 50 min and will time out soon. The size of the tarball produced is plausible, and the modification time implies backupsnip took 63 minutes to complete (clock on aus2-staging is correct too). I'm unpacking the backup and diffing against the source to look for differences, but I suspect it's a completely valid backup.

Looking at the running processes on aus2-staging with 'ps auxf', I can't find a process tree of sshd -> bash -> backupsnip. Yet moz2-linux-slave20 (which is the slave that ssh'd in to run the backup) thinks the ssh process is still running. It's in state Ss+, so interruptible sleep, session leader, and foreground process group. Attaching strace to the process isn't any help, we'd have to attach earlier I suspect. 

Seems to me there is some sort of hang up problem here.
Nick: if you end up doing it by hand, try creating the tarball in /tmp (as Aravind suggested last week) and then moving the created tarball into the correct dir. That will avoid a double-hit on the NFS partition.

Anticipating having to do this for 3.5.4 as well.
Passing over to phong to check vmware settings.
Assignee: aravind → phong
Created attachment 409665 [details] [diff] [review]
[checked in] Verbose tar

I tried making the tar command verbose as a way of keeping data flowing over the ssh link from aus2-staging to linux slave. A buildbot-driven backupsnip succeeded with this change, taking a bit longer at 75 minutes. I've only tried it once so this is certainly not well tested for reliability, but might be useful for 3.5.5.
Attachment #409665 - Flags: review?(bhearsum)
Comment on attachment 409665 [details] [diff] [review]
[checked in] Verbose tar

This local change on aus2-staging panicked me good when I was trying to backupsnip for 3.6b1. But if it works, let's do it!
Attachment #409665 - Flags: review?(bhearsum) → review+
(In reply to comment #20)
> Created an attachment (id=409665) [details]
> Verbose tar
> 
> I tried making the tar command verbose as a way of keeping data flowing over
> the ssh link from aus2-staging to linux slave. A buildbot-driven backupsnip
> succeeded with this change, taking a bit longer at 75 minutes. I've only tried
> it once so this is certainly not well tested for reliability, but might be
> useful for 3.5.5.

Thanks Nick. Yep, more verbose is slightly slower, as you noted. However, it could be really useful for figuring out whats causing the timeout. (One theory I had was that the disks were fine, and backupsnip completed successfully, but that the end-of-process got missed by buildbot somehow. Your change should make this easy to spot next time it bites us.)
Comment on attachment 409665 [details] [diff] [review]
[checked in] Verbose tar

Checking in backupsnip;
/cvsroot/mozilla/tools/release/bin/backupsnip,v  <--  backupsnip
new revision: 1.2; previous revision: 1.1
done

aus2-staging:~cltbld/bin updated; staging-stage:~cltbld/bin set to pull from :ext:stgbld@cvs.mozilla.org:/cvsroot rather than the local CVS mirror, and updated. Sorry about that scare Ben, forgot to let you know.
Attachment #409665 - Attachment description: Verbose tar → [checked in] Verbose tar
(Assignee)

Comment 24

9 years ago
[14:44]  <nthomas> we've worked around the problem
[14:54]  <phong> if that's the case, can we close that bug?
[14:59]  <nthomas> guh, that bug is so confusing
[15:00]  <nthomas> yeah, why don't you close it and we'll reopen if we hit it again
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.