Closed Bug 1501250 Opened 6 years ago Closed 6 years ago

Intermittent [worker:error] distutils.errors.DistutilsFileError: cannot copy tree '/builds/worker/artifacts': not a directory

Categories

(Testing :: General, defect, P5)

Version 3
defect

Tracking

(firefox-esr60 fixed, firefox64 fixed, firefox65 fixed)

RESOLVED FIXED
mozilla65
Tracking Status
firefox-esr60 --- fixed
firefox64 --- fixed
firefox65 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: dragrom)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disable-recommended])

Attachments

(1 file)

possibly due to Bug 1474570
See Also: → 1474570
(In reply to Bob Clary [:bc:] from comment #1)
> possibly due to Bug 1474570

Yeah, this certainly is the cause. Sorry that I didn't catch this in review.

This broken task runs on taskcluster-worker ("provisionerId": "proj-autophone", "workerType": "gecko-t-ap-unit-p2"), so it looks like the taskcluster-worker implementation has been broken during the migration from taskcluster-worker to generic-worker for linux talos tasks.

I suspect this will be a relatively simple fix that we can roll out quickly.

The points of interest are:

In the logs, I see:

+ : WORKING_DIR /builds/worker/workspace
+ : WORKSPACE /builds/worker/workspace

From task definition https://queue.taskcluster.net/v1/task/Ag2De52ITG6fa3SEaS2PBQ I see "WORKSPACE" is set to "/builds/worker/workspace" and WORKING_DIR isn't set, so it will default to the current directory. It looks like taskcluster-worker runs processes from /builds/worker/workspace directory (but I'll have to check the taskcluster-worker implementation to see if it uses "WORKSPACE" env var or if it chooses this path some other way (such as hardcoded to ~/workspace).

I suspect the solution will be to pass in both WORKING_DIR _instead_ of WORKSPACE, with `WORKING_DIR=/builds/worker`. That should work with the updated test-linux.sh script.
Note, longer term, the preferred fix is to migrate to generic-worker from taskcluster-worker (bug 1488392) - I believe project-autophone tasks are the last remaining tasks that run on taskcluster-worker.
That is high on my list of todos and getting higher every minute. ;-)
(In reply to Bob Clary [:bc:] from comment #4)
> That is high on my list of todos and getting higher every minute. ;-)

Haha, no worries! :-)

Typo in comment 2:

> I suspect the solution will be to pass in both WORKING_DIR _instead_ of
> WORKSPACE, with `WORKING_DIR=/builds/worker`. That should work with the
> updated test-linux.sh script.

should have been:

> I suspect the solution will be to pass in WORKING_DIR _instead_ of
> WORKSPACE, with `WORKING_DIR=/builds/worker`. That should work with the
> updated test-linux.sh script.
> I suspect the solution will be to pass in WORKING_DIR _instead_ of
> WORKSPACE, with `WORKING_DIR=/builds/worker`. That should work with the
> updated test-linux.sh script.

I've created https://tools.taskcluster.net/groups/CsEKkSVZSYKzwFoZ5POEIA/tasks/CsEKkSVZSYKzwFoZ5POEIA/details to test this hypothesis. It is a copy of https://queue.taskcluster.net/v1/task/Ag2De52ITG6fa3SEaS2PBQ but with the env vars changed; I removed WORKSPACE and set WORKING_DIR to /builds/worker.

Let's see how it goes!
We might need to update the bitbar docker container to handle WORKING_DIR. If WORKSPACE is not specified, it will set it to /builds/worker/workspace and pass WORKSPACE to the taskcluster-worker's environment but it won't know about WORKING_DIR and won't pass it at all. I have to run out to an appointment this morning and will be gone for 2-3 hours. I'll check back when I return.
(In reply to Pete Moore [:pmoore][:pete] from comment #6)

> I've created
> https://tools.taskcluster.net/groups/CsEKkSVZSYKzwFoZ5POEIA/tasks/
> CsEKkSVZSYKzwFoZ5POEIA/details to test this hypothesis.

This task is still pending after 20 minutes - does your tool to spawn new workers fetch the pending count from here?

  https://queue.taskcluster.net/v1/pending/proj-autophone/gecko-t-ap-unit-p2

I had a vague memory that maybe it queries treeherder for pending tasks, but this task won't appear on treeherder, so it might be better to fetch the pending count directly from taskcluster.

Many thanks!
(In reply to Bob Clary [:bc:] from comment #7)
> We might need to update the bitbar docker container to handle WORKING_DIR.
> If WORKSPACE is not specified, it will set it to /builds/worker/workspace
> and pass WORKSPACE to the taskcluster-worker's environment but it won't know
> about WORKING_DIR and won't pass it at all. I have to run out to an
> appointment this morning and will be gone for 2-3 hours. I'll check back
> when I return.

Ah ok - many thanks. In that case we could set both explicitly in the task definition:

"WORKING_DIR": "/builds/worker",
"WORKSPACE": "/builds/worker/workspace",

the test-linux.sh script won't overwrite them if they are already set.
Attachment #9019384 - Flags: review?(pmoore)
Assignee: nobody → dcrisan
Status: NEW → ASSIGNED
(In reply to Pete Moore [:pmoore][:pete] from comment #8)
> (In reply to Pete Moore [:pmoore][:pete] from comment #6)
> 
> > I've created
> > https://tools.taskcluster.net/groups/CsEKkSVZSYKzwFoZ5POEIA/tasks/
> > CsEKkSVZSYKzwFoZ5POEIA/details to test this hypothesis.
> 

That finally ran. Unfortunately most hit bug 1499246 but at least one hit this error.

> This task is still pending after 20 minutes - does your tool to spawn new
> workers fetch the pending count from here?
> 
>   https://queue.taskcluster.net/v1/pending/proj-autophone/gecko-t-ap-unit-p2
> 

No.

> I had a vague memory that maybe it queries treeherder for pending tasks, but
> this task won't appear on treeherder, so it might be better to fetch the
> pending count directly from taskcluster.
> 
> Many thanks!

It does use treeherder at the moment. I'll look into changing it to use the pending queue. Filed Bug 1501350. Thanks.

(In reply to Dragos Crisan [:dragrom] from comment #11)
> Test patch on try:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=aa3c248be83067e34d957928a85182a9a399a992

Unfortunately that didn't exercise the android-hw. This will work:

./mach try fuzzy --full --query "android-hw mda"

But if you like, I can submit your patch and check it out. Let me know.
(In reply to Bob Clary [:bc:] from comment #12)
> (In reply to Pete Moore [:pmoore][:pete] from comment #8)
> > (In reply to Pete Moore [:pmoore][:pete] from comment #6)
> > 
> > > I've created
> > > https://tools.taskcluster.net/groups/CsEKkSVZSYKzwFoZ5POEIA/tasks/
> > > CsEKkSVZSYKzwFoZ5POEIA/details to test this hypothesis.
> > 
> 
> That finally ran. Unfortunately most hit bug 1499246 but at least one hit
> this error.
> 
> > This task is still pending after 20 minutes - does your tool to spawn new
> > workers fetch the pending count from here?
> > 
> >   https://queue.taskcluster.net/v1/pending/proj-autophone/gecko-t-ap-unit-p2
> > 
> 
> No.
> 
> > I had a vague memory that maybe it queries treeherder for pending tasks, but
> > this task won't appear on treeherder, so it might be better to fetch the
> > pending count directly from taskcluster.
> > 
> > Many thanks!
> 
> It does use treeherder at the moment. I'll look into changing it to use the
> pending queue. Filed Bug 1501350. Thanks.
> 
> (In reply to Dragos Crisan [:dragrom] from comment #11)
> > Test patch on try:
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=aa3c248be83067e34d957928a85182a9a399a992
> 
> Unfortunately that didn't exercise the android-hw. This will work:
> 
> ./mach try fuzzy --full --query "android-hw mda"
> 
> But if you like, I can submit your patch and check it out. Let me know.

Please submit my patch and let me know if it work.I also added the M tests from android 8 in https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa3c248be83067e34d957928a85182a9a399a992.
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&group_state=expanded&revision=043f66d558d7173087740eeddf8248c34259c0d6

I don't think this will help as the bitbar containers are unaware of WORKING_DIR, but we'll see.
dragrom: This did seem to help. The failures in my try push are not related to this error.
Comment on attachment 9019384 [details] [diff] [review]
fix_bitbar_tests.patch

Review of attachment 9019384 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, many thanks!
Attachment #9019384 - Flags: review?(pmoore) → review+
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/460f9791ba8a
Intermittent [worker:error] distutils.errors.DistutilsFileError: cannot copy tree '/builds/worker/artifacts': not a directory, r=pmoore
Attachment #9019384 - Flags: checked-in+
We'll want this on beta as well now that bug 1474570 has merged there.
https://hg.mozilla.org/mozilla-central/rev/460f9791ba8a
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Comment on attachment 9019384 [details] [diff] [review]
fix_bitbar_tests.patch

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1474570

User impact if declined: No android hardware testing on mozilla-beta

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): Not risky as it is a simple change to add an environment variable to the test environment.

String changes made/needed:
Attachment #9019384 - Flags: approval-mozilla-beta?
Comment on attachment 9019384 [details] [diff] [review]
fix_bitbar_tests.patch

test-only changes don't need approval to land
Attachment #9019384 - Flags: approval-mozilla-beta?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: