Open Bug 1859323 Opened 1 year ago Updated 4 months ago

generic-worker: open tasks-resolved-count.txt: permission denied

Categories

(Infrastructure & Operations :: RelOps: Posix OS, defect)

defect

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: jcristau, Assigned: aerickson)

References

Details

(Whiteboard: [relops-linux])

Attachments

(1 file)

We're getting error emails from generic-worker on the translations-1/t-linux-v100-gpu pool, which says open tasks-resolved-count.txt: permission denied
I'm not sure where that comes from, whether it's a problem with the image, or g-w itself, or the use of the simple engine?

Component: Workers → RelOps: Posix OS
Product: Taskcluster → Infrastructure & Operations

My guess is that generic-worker was run as one user when it created the file tasks-resolved-count.txt and now as another user, which doesn't have read access to the file. Moving to RelOps as they manage this worker pool.

@aerickson, can you take a look at the permissions/ownership of that file, and see which user the generic-worker is running as? Thanks!

Flags: needinfo?(aerickson)

I'm getting similar errors for my test pool.

Worker Manager has encountered an error while trying to provision the worker pool translations-1/b-linux-aerickson-test:


open tasks-resolved-count.txt: permission denied

ErrorId: 5pdJVjvmRViwvr0j5R6eSA

It includes the extra information:

GOARCH: amd64
GOOS: linux
cleanUpTaskDirs: 'true'
deploymentId: ''
engine: simple
gwRevision: fa6cdbdf9a2aa6616cb6997e5f91e14a33685c87
gwVersion: 48.1.0
instanceType: projects/887720501152/machineTypes/n2-standard-2
provisionerId: translations-1
rootURL: https://firefox-ci-tc.services.mozilla.com
workerGroup: us-central1
workerId: '7765874583162361319'
workerType: b-linux-aerickson-test

@pmoore, these are built in monopacker. This is happening during provisioning (they have never had g-w run on them... so how could this file exist)? Does the same error get emitted if the file is not present? What path does g-w try to create this file at (would be helpful to have in the error also)?

Flags: needinfo?(aerickson) → needinfo?(pmoore)

Ah, interesting. It tries to create the file in the working directory of the generic-worker process.
It looks like Worker Runner doesn't explicitly set a working directory for generic-worker, so it should inherit the working directory of the start-worker process.

In the systemd Service configuration I don't see an explicit WorkingDirectory setting. That probably should be set to a writable path for the user that runs generic-worker.

Flags: needinfo?(pmoore)
Assignee: nobody → aerickson
Status: NEW → ASSIGNED
Whiteboard: [relops-linux]

Thanks :pmoore. I've created a fixed image and will test it out.

:jcristau, The image mentioned in the phab above should fix the systemd issue (it also updates the TC components). I'll run some sanity tests.

Pushed by aerickson@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/0b0246c360bc translations-testing: use latest image r=relsre-reviewers,markco

(In reply to Andrew Erickson [:aerickson] from comment #8)

I've started a try push testing to test this out.

https://treeherder.mozilla.org/jobs?repo=try&revision=385b80206da496b84e82c4d7b5269113e4c02e92

Wrong link. I don't have a try command to test out this pool (TODO: work with releng to find one). I've run a basic TC 'create task' that sleeps 100 and it looks good (https://firefox-ci-tc.services.mozilla.com/tasks/bJCbiV2gQjm_TkvXzRU5Hw/runs/0).

:jcristau, will you please try it out in your test queue (and if it looks good use it in prod)?

Flags: needinfo?(jcristau)

:pmoore, it doesn't seem to have fixed the issue (PR is https://github.com/mozilla-platform-ops/monopacker/pull/115). I noticed I received this on my test pool (I know it's the new image because I upgraded to 57.0.1):

Worker Manager has encountered an error while trying to provision the worker pool translations-1/b-linux-aerickson-test:


open tasks-resolved-count.txt: permission denied

ErrorId: ee28-1x9TrG91q_e22P5cw

It includes the extra information:

GOARCH: amd64
GOOS: linux
cleanUpTaskDirs: 'true'
deploymentId: ''
engine: simple
gwRevision: fe45895655139035aac1083981d4ecf7b78cfb2e
gwVersion: 57.0.1
instanceType: projects/887720501152/machineTypes/n2-standard-2
provisionerId: translations-1
rootURL: https://firefox-ci-tc.services.mozilla.com
workerGroup: us-central1
workerId: '5767626759346923236'
workerType: b-linux-aerickson-test

Can you make that error give the full path? Not sure what's going on.

Flags: needinfo?(jcristau) → needinfo?(pmoore)

(In reply to Andrew Erickson [:aerickson] from comment #10)

Can you make that error give the full path? Not sure what's going on.

Sure. That is a great idea. I've created https://github.com/taskcluster/taskcluster/pull/6680.

Flags: needinfo?(pmoore)

(In reply to Pete Moore [:pmoore][:pete] from comment #11)

(In reply to Andrew Erickson [:aerickson] from comment #10)

Can you make that error give the full path? Not sure what's going on.

Sure. That is a great idea. I've created https://github.com/taskcluster/taskcluster/pull/6680.

Hey Andy, do you want to try again? It looks like this got released in v58.0.0.

Flags: needinfo?(aerickson)

From last week:

Worker Manager has encountered an error while trying to provision the worker pool translations-1/b-linux-v100-gpu:


open /tasks-resolved-count.txt: permission denied

ErrorId: 472iZAFeTFCk_h2ItTSpoQ

It includes the extra information:

GOARCH: amd64
GOOS: linux
cleanUpTaskDirs: 'true'
deploymentId: ''
engine: simple
gwRevision: f5527730e1a548eac1e79871e2eb0b7243d1868a
gwVersion: 60.4.0
instanceType: projects/887720501152/machineTypes/n1-highmem-8
provisionerId: translations-1
rootURL: https://firefox-ci-tc.services.mozilla.com
workerGroup: us-central1
workerId: '6347263257659500946'
workerType: b-linux-v100-gpu

g-w isn't running as root on this pool AIUI, so it can't write to /.

That suggests the working directory of worker runner is /. That should be a location that the process has permission to write to.

Note, we shouldn't be running simple/insecure engine in production. Are there bugs on file for the issues which prevent us from running multiuser instead? We should probably set up a tracking bug, in case there are more than one.

My PR didn't work, perhaps due to using 'WorkingDir' vs 'WorkingDirectory'. I'll try another build.

These images were created before the GCP/Ubuntu kernel fix, so it was not possible to use multiuser due to it requiring a GUI. We would also prefer to avoid the overhead of a GUI as it's not needed. I believe we already have a bug to remove the requirement for a GUI with multiuser.

Flags: needinfo?(aerickson)

I built a new image with my fix in https://github.com/mozilla-platform-ops/monopacker/pull/131.

I think Taskcluster has pushed a change that sets this value if not present, so perhaps a fix is not needed any longer.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: