generic-worker: open tasks-resolved-count.txt: permission denied
Categories
(Infrastructure & Operations :: RelOps: Posix OS, defect)
Tracking
(Not tracked)
People
(Reporter: jcristau, Assigned: aerickson)
References
Details
(Whiteboard: [relops-linux])
Attachments
(1 file)
We're getting error emails from generic-worker on the translations-1/t-linux-v100-gpu pool, which says open tasks-resolved-count.txt: permission denied
I'm not sure where that comes from, whether it's a problem with the image, or g-w itself, or the use of the simple engine?
Updated•1 year ago
|
Comment 1•1 year ago
|
||
My guess is that generic-worker was run as one user when it created the file tasks-resolved-count.txt
and now as another user, which doesn't have read access to the file. Moving to RelOps as they manage this worker pool.
@aerickson, can you take a look at the permissions/ownership of that file, and see which user the generic-worker is running as? Thanks!
Assignee | ||
Comment 2•1 year ago
|
||
I'm getting similar errors for my test pool.
Worker Manager has encountered an error while trying to provision the worker pool translations-1/b-linux-aerickson-test:
open tasks-resolved-count.txt: permission denied
ErrorId: 5pdJVjvmRViwvr0j5R6eSA
It includes the extra information:
GOARCH: amd64
GOOS: linux
cleanUpTaskDirs: 'true'
deploymentId: ''
engine: simple
gwRevision: fa6cdbdf9a2aa6616cb6997e5f91e14a33685c87
gwVersion: 48.1.0
instanceType: projects/887720501152/machineTypes/n2-standard-2
provisionerId: translations-1
rootURL: https://firefox-ci-tc.services.mozilla.com
workerGroup: us-central1
workerId: '7765874583162361319'
workerType: b-linux-aerickson-test
@pmoore, these are built in monopacker. This is happening during provisioning (they have never had g-w run on them... so how could this file exist)? Does the same error get emitted if the file is not present? What path does g-w try to create this file at (would be helpful to have in the error also)?
Comment 3•1 year ago
•
|
||
Ah, interesting. It tries to create the file in the working directory of the generic-worker process.
It looks like Worker Runner doesn't explicitly set a working directory for generic-worker, so it should inherit the working directory of the start-worker process.
In the systemd Service configuration I don't see an explicit WorkingDirectory
setting. That probably should be set to a writable path for the user that runs generic-worker.
Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 4•1 year ago
|
||
Thanks :pmoore. I've created a fixed image and will test it out.
Assignee | ||
Comment 5•1 year ago
|
||
See https://mozilla-hub.atlassian.net/browse/RELOPS-747 and https://github.com/mozilla-platform-ops/monopacker/pull/115.
This latest change updates the TC components.
Assignee | ||
Comment 6•1 year ago
|
||
:jcristau, The image mentioned in the phab above should fix the systemd issue (it also updates the TC components). I'll run some sanity tests.
Assignee | ||
Comment 8•1 year ago
|
||
I've started a try push testing to test this out.
https://treeherder.mozilla.org/jobs?repo=try&revision=385b80206da496b84e82c4d7b5269113e4c02e92
Assignee | ||
Comment 9•1 year ago
|
||
(In reply to Andrew Erickson [:aerickson] from comment #8)
I've started a try push testing to test this out.
https://treeherder.mozilla.org/jobs?repo=try&revision=385b80206da496b84e82c4d7b5269113e4c02e92
Wrong link. I don't have a try command to test out this pool (TODO: work with releng to find one). I've run a basic TC 'create task' that sleeps 100 and it looks good (https://firefox-ci-tc.services.mozilla.com/tasks/bJCbiV2gQjm_TkvXzRU5Hw/runs/0).
:jcristau, will you please try it out in your test queue (and if it looks good use it in prod)?
Assignee | ||
Comment 10•1 year ago
|
||
:pmoore, it doesn't seem to have fixed the issue (PR is https://github.com/mozilla-platform-ops/monopacker/pull/115). I noticed I received this on my test pool (I know it's the new image because I upgraded to 57.0.1):
Worker Manager has encountered an error while trying to provision the worker pool translations-1/b-linux-aerickson-test:
open tasks-resolved-count.txt: permission denied
ErrorId: ee28-1x9TrG91q_e22P5cw
It includes the extra information:
GOARCH: amd64
GOOS: linux
cleanUpTaskDirs: 'true'
deploymentId: ''
engine: simple
gwRevision: fe45895655139035aac1083981d4ecf7b78cfb2e
gwVersion: 57.0.1
instanceType: projects/887720501152/machineTypes/n2-standard-2
provisionerId: translations-1
rootURL: https://firefox-ci-tc.services.mozilla.com
workerGroup: us-central1
workerId: '5767626759346923236'
workerType: b-linux-aerickson-test
Can you make that error give the full path? Not sure what's going on.
Comment 11•1 year ago
|
||
(In reply to Andrew Erickson [:aerickson] from comment #10)
Can you make that error give the full path? Not sure what's going on.
Sure. That is a great idea. I've created https://github.com/taskcluster/taskcluster/pull/6680.
Comment 12•8 months ago
•
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #11)
(In reply to Andrew Erickson [:aerickson] from comment #10)
Can you make that error give the full path? Not sure what's going on.
Sure. That is a great idea. I've created https://github.com/taskcluster/taskcluster/pull/6680.
Hey Andy, do you want to try again? It looks like this got released in v58.0.0.
Reporter | ||
Comment 13•8 months ago
|
||
From last week:
Worker Manager has encountered an error while trying to provision the worker pool translations-1/b-linux-v100-gpu:
open /tasks-resolved-count.txt: permission denied
ErrorId: 472iZAFeTFCk_h2ItTSpoQ
It includes the extra information:
GOARCH: amd64
GOOS: linux
cleanUpTaskDirs: 'true'
deploymentId: ''
engine: simple
gwRevision: f5527730e1a548eac1e79871e2eb0b7243d1868a
gwVersion: 60.4.0
instanceType: projects/887720501152/machineTypes/n1-highmem-8
provisionerId: translations-1
rootURL: https://firefox-ci-tc.services.mozilla.com
workerGroup: us-central1
workerId: '6347263257659500946'
workerType: b-linux-v100-gpu
g-w isn't running as root on this pool AIUI, so it can't write to /
.
Comment 14•8 months ago
|
||
That suggests the working directory of worker runner is /. That should be a location that the process has permission to write to.
Comment 15•8 months ago
|
||
Note, we shouldn't be running simple/insecure engine in production. Are there bugs on file for the issues which prevent us from running multiuser instead? We should probably set up a tracking bug, in case there are more than one.
Reporter | ||
Comment 16•8 months ago
|
||
Assignee | ||
Comment 17•8 months ago
|
||
My PR didn't work, perhaps due to using 'WorkingDir' vs 'WorkingDirectory'. I'll try another build.
These images were created before the GCP/Ubuntu kernel fix, so it was not possible to use multiuser due to it requiring a GUI. We would also prefer to avoid the overhead of a GUI as it's not needed. I believe we already have a bug to remove the requirement for a GUI with multiuser.
Assignee | ||
Comment 18•4 months ago
|
||
I built a new image with my fix in https://github.com/mozilla-platform-ops/monopacker/pull/131.
I think Taskcluster has pushed a change that sets this value if not present, so perhaps a fix is not needed any longer.
Description
•