Closed Bug 544727 Opened 14 years ago Closed 14 years ago

Rev 3 Windows Talos machines not always successfully doing cleanup

Categories

(Release Engineering :: General, defect, P2)

x86
Windows 7
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: anodelman)

References

Details

(Keywords: intermittent-failure)

Attachments

(4 files, 1 obsolete file)

Since it's new, my first suspicion would be that while nohup exists, it maybe doesn't exactly always _work_ on Windows.

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265505248.1265505990.31842.gz
Rev3 WINNT 7.0 mozilla-central talos on 2010/02/06 17:14:08  

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265505248.1265508537.27258.gz
Rev3 WINNT 7.0 mozilla-central talos dirty on 2010/02/06 17:14:08  

C:\Windows\system32\cmd.exe /c nohup rm -vrf *
...
removed directory: `talos/tpan'
removed `talos/ttest.py'
removed `talos/ttest.pyc'
removed `talos/utils.py'
removed `talos/utils.pyc'
removed `talos/winmo.config'
program finished with exit code 0
elapsedTime=242.818000
=== Output ended ===
======== BuildStep ended ========
======== BuildStep started ========
talos dir creation failed
=== Output ===
C:\Windows\system32\cmd.exe /c mkdir talos
...
A subdirectory or file talos already exists.
program finished with exit code 1
Not at all a useful guide to frequency, since tbpl doesn't show burning Talos for some reason, so I only see them when I happen to notice firebot mentioning them changing from something else to burning, but:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265678130.1265681437.3902.gz
Rev3 WINNT 7.0 mozilla-central talos dirty on 2010/02/08 17:15:30  
s: talos-r3-w7-014

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1265678320.1265679042.8042.gz
Rev3 WINNT 7.0 mozilla-central talos nochrome on 2010/02/08 17:18:40  
s: talos-r3-w7-006

(Odd, though possibly coincidence, that both times I've seen it it's been a pair of failures off the same run.)
we could try using |attrib -s -h -r /s builddir | and |rmdir /s /q builddir| instead of the msys coreutils rm.  rmdir will not delete system or hidden files, which is what the attrib command does (remove system, hidden, read-only flags) recursively so rmdir can remove the directory recursively and quietly.

(http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/rmdir.mspx?mfr=true
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/rmdir.mspx?mfr=true)
Assignee: nobody → anodelman
Priority: -- → P2
On staging I found that there were permission denied warnings during file removal on w7.  This patch does a chmod a+rx before attempting to delete files.  I believe that this should fix this random orange.
Attachment #426591 - Flags: review?(catlee)
Attachment #426591 - Flags: review?(catlee) → review+
Attachment #426591 - Attachment description: chmod files before attempting to delete them → [checked in]chmod files before attempting to delete them
Attachment #426591 - Flags: checked-in+
This is an untested patch.  Instead of using coreutils rm and chmod it uses the native windows attrib and rmdir.  The cleandir property is needed because rmdir and attrib do not understand the concept of 'rmdir /s /q *' so we either need to hardcode a list of directories, use the same directory name for all builders or delete the builder dir using attrib/rmdir from .. and then recreate it (this patch).

There is a buildbot feature in (recent) versions that allows you to share the slave side builder dir.

I am not sure if this will work as I don't know all of the peculiarities of the windows buildbot slave.

Another option might be to use the cygwin coreutils package.  It might have a more up to date version of coreutils and should run with only the rm.exe and cygwin1.dll (we don't need the whole cygwin stack)
Attachment #427174 - Flags: review?(anodelman)
alternatively, it seems that the directory we care about is called 'talos', if that is the case, we could just run rmdir /s /q talos instead of doing all the properties
This issue also appears on our fed/fed64 rev3 slaves.
This will fix the redness on fed/fed64 by using nohup during cleanup (this has worked in the past and was removed during a failed attempt at using usepty=0 in the buildbot slave config).

Adding chmod a+rwx before doing the cleanup on windows to see if that will fix things there.

While this is baking I'll look into jhford's solution so that we have something to try next.
Attachment #427215 - Flags: review?(joduinn)
Attachment #427215 - Flags: review?(joduinn) → review+
Comment on attachment 427215 [details] [diff] [review]
[checked in]quick fix to try overnight

looks good.
Comment on attachment 427215 [details] [diff] [review]
[checked in]quick fix to try overnight

changeset:   613:0fa180cd0e4a
Attachment #427215 - Attachment description: quick fix to try overnight → [checked in]quick fix to try overnight
Attachment #427215 - Flags: checked-in+
Linux green overnight.

Still seeing intermittent problems on win7.
that is unfortunate, i wonder if running that on the command line of the slave would change anything.  It looks like a lot of those files are web page files.  Maybe Apache is trying to read the files while we are doing clean up?
Another possible fix.  Move talos to talos-%random%, thus even if cleanup fails we can still successfully create a new, clean talos dir and carry on with testing.  We'll get another chance to clean up the old talos dir on reboot.
Attachment #427452 - Flags: review?(joduinn)
Attachment #427452 - Flags: review?(joduinn) → review+
Comment on attachment 427452 [details] [diff] [review]
move talos dir out of the way before attempting cleanup

already tested in staging. looks good.
I think that this is the for real fix.  tp4 contains some really, really long paths + filenames.  I believe that we are exceeding the limit, and thus crash on attempting to delete the files.  Moving the whole tp4 directory out of talos/page_load_test means that we can successfully remove everything.

Works on stage.  Also matches with my observations of attempting to remove the long path named files by hand.
Attachment #427452 - Attachment is obsolete: true
Attachment #427497 - Flags: review?(joduinn)
Attachment #427497 - Flags: review?(joduinn) → review+
Comment on attachment 427497 [details] [diff] [review]
[checked in]move tp4 dir to shorter path before attempting cleanup

looks good, works in staging, so r+. Also, I note this is similar to a problem we hit couple of years ago on win32 desktop builds.
Comment on attachment 427497 [details] [diff] [review]
[checked in]move tp4 dir to shorter path before attempting cleanup

changeset:   617:996d58ea54a5
Attachment #427497 - Attachment description: move tp4 dir to shorter path before attempting cleanup → [checked in]move tp4 dir to shorter path before attempting cleanup
Attachment #427497 - Flags: checked-in+
All green overnight.  Will re-open if this reoccurs.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Whiteboard: [orange]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: