Closed Bug 1210055 Opened 4 years ago Closed 4 years ago

Intermittent Windows Talos failures like Unable to remove C:\slave\test-pgo\build! or Caught exception: (206, 'DeleteFile', 'The filename or extension is too long.')

Categories

(Release Engineering :: General, defect)

defect
Not set

Tracking

(firefox41 unaffected, firefox42 affected, firefox43 affected, firefox44 fixed, firefox-esr38 unaffected)

RESOLVED FIXED
Tracking Status
firefox41 --- unaffected
firefox42 --- affected
firefox43 --- affected
firefox44 --- fixed
firefox-esr38 --- unaffected

People

(Reporter: KWierso, Assigned: jmaher)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

Possibly PGOhttps://treeherder.mozilla.org/logviewer.html#?job_id=14964165&repo=mozilla-inbound-only.
while this is in a talos job, we haven't even made it to talos code, this is more of a machine issue.  One suspect would be that we are now putting all page files inside of the talos directory instead of apache- and we were close to the limits of the file system (I had to s/page_load_test/tests/).  

Why do we have test-pgo vs test in the directory name?  How can I determine this, it seems like a real problem?  Also why is this XP only, if it was a real issue with filenames, I suspect windows 7 and possibly windows 8 would be experiencing the same issues.

Lastly, this is seen on beta/aurora where we don't have talos with a local webserver.  I think that is irrelevant because if the talos local webserver is to blame then it is messing things up per machine, not per branch.

Either way, there are a lot of missing data points.
Component: Talos → General Automation
Product: Testing → Release Engineering
QA Contact: catlee
:callek, any thoughts on the above questions?  Maybe getting a loaner without cleaning it up would help?
Flags: needinfo?(bugspam.Callek)
a loaner might be a help, but let me punt over to :arr first incase her and her windows experts have any insights.
Flags: needinfo?(bugspam.Callek) → needinfo?(arich)
Not sure. Maybe only XP has path names that exceed the maximum because of specifically named files for that platform?
Flags: needinfo?(arich)
Depends on: 1210495
Looking at affected machines (easy to do, that's nearly the entire pool now), e.g.

https://secure.pub.build.mozilla.org/buildapi/recent/t-xp32-ix-058?numbuilds=200
https://secure.pub.build.mozilla.org/buildapi/recent/t-xp32-ix-136?numbuilds=200
https://secure.pub.build.mozilla.org/buildapi/recent/t-xp32-ix-065?numbuilds=200
https://secure.pub.build.mozilla.org/buildapi/recent/t-xp32-ix-101?numbuilds=200

with a search for pgo talos, the common pattern seems to be "once you run a trunk g1 or g2 pgo run after ___, you will be broken for all pgo talos jobs, including aurora and beta, from then on." So far, I haven't found one where the pgo talos run immediately before the first instance of this was anything other than g1/g2, and the few still-unaffected ones have run g1/g2 pgo, but only on aurora and beta.

So I'd say g1/g2 plus talos with local webserver is within 4 characters of one of WinXP's various filename/path length limits, and -pgo in test-pgo puts it over.
I have been trying to reproduce this locally, I could easily remove all the files on the loaner machine I had, so I don't know why we have troubles.  Either way, I have a loaner and can hack on this a bit more.
Because I have that kind of time, I tried to retrigger my way to getting talos to run on the tip of aurora and beta. I got bored after 20 retriggers, with 2 of the 6 suites left still red on beta, and 4 on aurora. We just don't run talos on PGO builds on WinXP anymore, so I've hidden them.
(In reply to Joel Maher (:jmaher) from comment #27)
> I have been trying to reproduce this locally, I could easily remove all the
> files on the loaner machine I had, so I don't know why we have troubles. 

One part of "one of WinXP's various filename/path length limits" is that it matters what Windows API either you or the program you are using calls, so "I could remove them from Explorer when winrm or rm couldn't" isn't unusual, it's normal.
Blocks: 1211209
tested this on a loaner, I really wonder why we do the fancy win32 file stuff, we could just do something like "del /f /s /q %s" or "rmdir /f /s /q %s", but this win32api stuff is in there for some reason.

I assume this in-tree fix will work for where we are failing, I am not sure of the status of in-tree/out-tree mozharness bits.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8669593 - Flags: review?(j.parkouss)
Attachment #8669593 - Flags: feedback?(jlund)
Comment on attachment 8669593 [details] [diff] [review]
fallback to shell command when all else fails (1.0)

LGTM.
Attachment #8669593 - Flags: review?(j.parkouss) → review+
https://hg.mozilla.org/mozilla-central/rev/78a1f7a71490
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
> I assume this in-tree fix will work for where we are failing, I am not sure
> of the status of in-tree/out-tree mozharness bits.

out-tree mh is only used for a few releng services that are forked and self contained
thanks jlund!  This patch seems to be working in-tree, so this is good news!
Attachment #8669593 - Flags: feedback?(jlund)
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.