521139 - aus2-staging running slow, causing rel-automation timeout

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Description

•

15 years ago

As part of release-automation, we run "backupsnip" on aus2-staging. This typically takes 45-55minutes to complete. However, today, this command started at: Wed Oct 7 14:30:34 2009 PDT and timed out after 2 hours. backupsnip copies many many small files from 10.253.0.11:/vol/aus2/, which justdave says is netapp-b. Would be worth checking the load on that device. 

Setting blocker severity as this is impacting the FF3.5.4 release in progress.

Dave Miller [:justdave]

Comment 1

•

15 years ago

Aravind is the most-likely to actually know what's going on, he's away from a computer at the moment, but should be on in 20 to 30 minutes.

Assignee: server-ops → aravind

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 2

•

15 years ago

(In reply to comment #1)
> Aravind is the most-likely to actually know what's going on, he's away from a
> computer at the moment, but should be on in 20 to 30 minutes.

ok, thanks justdave. coop has been on the hotseat all day, so he's taking this time to go let off some ultimate-steam and get some food.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 3

•

15 years ago

18:41: aravind, nthomas working on this in irc.

Chris Cooper [:coop] (he/him)

Comment 4

•

15 years ago

I started up the actual backupsnip command we need to run in a screen session right on aus2-staging. It's running time-d, and I'll post the numbers here when it completes.

Chris Cooper [:coop] (he/him)

Comment 5

•

15 years ago

(In reply to comment #4)
> I started up the actual backupsnip command we need to run in a screen session
> right on aus2-staging. It's running time-d, and I'll post the numbers here when
> it completes.

Took just over one hour to complete, which is in line with backupsnip attempts for previous releases.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 6

•

15 years ago

summary from irc:

1) backupsnip typically takes 55-60mins. coop did several attempts today which each failed out after 2 hours. fwiw, this worked yesterday normally when lsblakk did the backupsnip for FF3.0.15.

2) coop tried 4 times, running from a few different slaves. Details in https://wiki.mozilla.org/Releases/Firefox_3.5.4/BuildNotes. It is unknown why this last attempt worked. 

3) mrz/aravind continuing to investigate.

4) Note: I marked this as blocker because it was blocking work on the FF3.5.4 release. We've now got past that blockage, and have announced the updates snippets to QA. However, this needs to remain blocker until we know this wont bite us again when we do backupsnip and push to beta later this week.

Chris Cooper [:coop] (he/him)

Comment 7

•

15 years ago

Running a few backupsnips in a row from the buildbot harness. Logs and timing info can be found here:

http://production-master02.build.mozilla.org:8010/builders/updates/builds/5

Chris Cooper [:coop] (he/him)

Comment 8

•

15 years ago

(In reply to comment #7)
> Running a few backupsnips in a row from the buildbot harness. Logs and timing
> info can be found here:

Urghh, make that:

http://production-master02.build.mozilla.org:8010/builders/updates/builds/6

Had a little issue with the date being off in the original command.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 9

•

15 years ago

Problem *does* reproduce again today.

Start	Thu Oct 8 13:25:07 2009
End	Thu Oct 8 15:25:07 2009
...timed out at 2hr limit.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 10

•

15 years ago

(In reply to comment #6)
> 4) Note: I marked this as blocker because it was blocking work on the FF3.5.4
> release. We've now got past that blockage, and have announced the updates
> snippets to QA. However, this needs to remain blocker until we know this wont
> bite us again when we do backupsnip and push to beta later this week.

rel drivers just confirmed we will be pushing FF3.5.4 to beta tomorrow, so we do need this fixed before 11am PDT tomorrow.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 11

•

15 years ago

(In reply to comment #10)
> (In reply to comment #6)
> > 4) Note: I marked this as blocker because it was blocking work on the FF3.5.4
> > release. We've now got past that blockage, and have announced the updates
> > snippets to QA. However, this needs to remain blocker until we know this wont
> > bite us again when we do backupsnip and push to beta later this week.
> 
> rel drivers just confirmed we will be pushing FF3.5.4 to beta tomorrow, so we
> do need this fixed before 11am PDT tomorrow.

Obviously, we are trying out some workarounds in case this isnt resolved by 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one of these workarounds to come through for us in experiments tonight.

Chris Cooper [:coop] (he/him)

Comment 12

•

15 years ago

(In reply to comment #11) 
> Obviously, we are trying out some workarounds in case this isnt resolved by
> 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one
> of these workarounds to come through for us in experiments tonight.

Whoa, it's not as bad as all that. The backupsnip/pushnip commands we need to run tomorrow get run directly on aus2-staging anyway, so as long as those commands run to completion (even if they take longer), we're still fine. All our local runs have been fine time-wise so far, FWIW.

I want to figure out as much as anybody why spawning the command via buildbot is timing out, but buildbot isn't involved in the beta push tomorrow.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 13

•

15 years ago

Reproduced problem again running backupsnip:


Start	Thu Oct 8 16:59:37 2009
End	Thu Oct 8 17:59:23 2009
Status  SUCCESS

Start   Thu Oct 8 17:59:23 2009
End	Thu Oct 8 18:58:58 2009
Status  SUCCESS

Start	Thu Oct 8 18:58:58 2009
End	Thu Oct 8 20:59:01 2009
Status  FAIL

(still no idea why this is happening intermittently like this.)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 14

•

15 years ago

(In reply to comment #12)
> (In reply to comment #11) 
> > Obviously, we are trying out some workarounds in case this isnt resolved by
> > 11am tmrw. At this point, with slim chance of a "real" fix in time, we need one
> > of these workarounds to come through for us in experiments tonight.
> 
> Whoa, it's not as bad as all that. The backupsnip/pushnip commands we need to
> run tomorrow get run directly on aus2-staging anyway, so as long as those
> commands run to completion (even if they take longer), we're still fine. 
Agreed.


> All our local runs have been fine time-wise so far, FWIW.
Oh, ok. Good to know that, coop, thanks. 

I only saw the failure from the URL in comment#8 (coop's run), and comment#13 (joduinn's run). Not that it matters now that we have a workaround for tmrw. While this situation is not perfect, at least it feels like we can now muddle through tomorrow.


> I want to figure out as much as anybody why spawning the command via buildbot
> is timing out, but buildbot isn't involved in the beta push tomorrow.

cool.

Aravind Gottipati [:aravind]

Comment 15

•

15 years ago

Is there anything here that needs ITs attention (Specifically from netapp perspective)?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 16

•

15 years ago

(In reply to comment #15)
> Is there anything here that needs ITs attention (Specifically from netapp
> perspective)?

Welcome back aravind! 

Yep, this is still causing us problems. Its still unclear if this is a VM/ESX host issue, a netapp issue, or something else. Can you and mrz chat to see how best to proceed here? 

If it helps, see comments below for times where we hit problems again yesterday. Also, we can reproduce the problem easily if it help you with debugging.

Nick Thomas [:nthomas] (UTC+12)

Comment 17

•

15 years ago

I think there's something more subtle going on here. As I write, the backupsnip for 3.6b1 build1 has been running for 1hr 50 min and will time out soon. The size of the tarball produced is plausible, and the modification time implies backupsnip took 63 minutes to complete (clock on aus2-staging is correct too). I'm unpacking the backup and diffing against the source to look for differences, but I suspect it's a completely valid backup.

Looking at the running processes on aus2-staging with 'ps auxf', I can't find a process tree of sshd -> bash -> backupsnip. Yet moz2-linux-slave20 (which is the slave that ssh'd in to run the backup) thinks the ssh process is still running. It's in state Ss+, so interruptible sleep, session leader, and foreground process group. Attaching strace to the process isn't any help, we'd have to attach earlier I suspect. 

Seems to me there is some sort of hang up problem here.

Chris Cooper [:coop] (he/him)

Comment 18

•

15 years ago

Nick: if you end up doing it by hand, try creating the tarball in /tmp (as Aravind suggested last week) and then moving the created tarball into the correct dir. That will avoid a double-hit on the NFS partition.

Anticipating having to do this for 3.5.4 as well.

Aravind Gottipati [:aravind]

Comment 19

•

15 years ago

Passing over to phong to check vmware settings.

Assignee: aravind → phong

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Updated

•

15 years ago

Blocks: 525814

Nick Thomas [:nthomas] (UTC+12)

Comment 20

•

15 years ago

Attached patch [checked in] Verbose tar — Details — Splinter Review

I tried making the tar command verbose as a way of keeping data flowing over the ssh link from aus2-staging to linux slave. A buildbot-driven backupsnip succeeded with this change, taking a bit longer at 75 minutes. I've only tried it once so this is certainly not well tested for reliability, but might be useful for 3.5.5.

Attachment #409665 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

15 years ago

Comment on attachment 409665 [details] [diff] [review]
[checked in] Verbose tar

This local change on aus2-staging panicked me good when I was trying to backupsnip for 3.6b1. But if it works, let's do it!

Attachment #409665 - Flags: review?(bhearsum) → review+

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 22

•

15 years ago

(In reply to comment #20)
> Created an attachment (id=409665) [details]
> Verbose tar
> 
> I tried making the tar command verbose as a way of keeping data flowing over
> the ssh link from aus2-staging to linux slave. A buildbot-driven backupsnip
> succeeded with this change, taking a bit longer at 75 minutes. I've only tried
> it once so this is certainly not well tested for reliability, but might be
> useful for 3.5.5.

Thanks Nick. Yep, more verbose is slightly slower, as you noted. However, it could be really useful for figuring out whats causing the timeout. (One theory I had was that the disks were fine, and backupsnip completed successfully, but that the end-of-process got missed by buildbot somehow. Your change should make this easy to spot next time it bites us.)

Nick Thomas [:nthomas] (UTC+12)

Comment 23

•

15 years ago

Comment on attachment 409665 [details] [diff] [review]
[checked in] Verbose tar

Checking in backupsnip;
/cvsroot/mozilla/tools/release/bin/backupsnip,v  <--  backupsnip
new revision: 1.2; previous revision: 1.1
done

aus2-staging:~cltbld/bin updated; staging-stage:~cltbld/bin set to pull from :ext:stgbld@cvs.mozilla.org:/cvsroot rather than the local CVS mirror, and updated. Sorry about that scare Ben, forgot to let you know.

Attachment #409665 - Attachment description: Verbose tar → [checked in] Verbose tar

Phong Tran [:phong]

Assignee

Comment 24

•

15 years ago

[14:44]  <nthomas> we've worked around the problem
[14:54]  <phong> if that's the case, can we close that bug?
[14:59]  <nthomas> guh, that bug is so confusing
[15:00]  <nthomas> yeah, why don't you close it and we'll reopen if we hit it again

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard