Closed Bug 1171809 Opened 7 years ago Closed 7 years ago

docker-worker: Listen to pulse exchange to blow away (clobber) named caches (proposal)

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: wcosta)

References

Details

Attachments

(1 file)

53 bytes, text/x-github-pull-request
garndt
: review+
Details | Review
Sheriffs wants a clobber service... Not sure exactly what form it should take yet.

But maybe it's just that docker-worker listens to a pulse exchange and then gets a message routed with <provisionerId>.<workerType>
which tells it which named cache to clear.

I suspect the idea of some form of named cache might persist between different
worker implementations, so a generic way to signal cache purge might not be bad.
I partially think that us having cache poisoning issues means we cache too much.

@garndt, what do you think?
(I'm not sure this isn't an anti-pattern)
Flags: needinfo?(garndt)
I think a combination of using per branch caches and clobbering caches is a good step to reducing issues that we've seen in things like 1154669 

I also want to implement some disk stats soon on these workers to understand maybe the impsect of having more cache directories, but that's independent of this bug.

The worker could listen to that exchange, and then either remove it completely from volumes that can be used for tasks, or if it's currently in use, it can mark it dirty and then once it's freed it'll be removed.
Flags: needinfo?(garndt)
Does this bug cover adding Taskcluster jobs to the Clobberer as well so this infra is actually usable? Right now, those jobs aren't listed, so we can't clobber them even if it's in theory supported.
https://api.pub.build.mozilla.org/clobberer
I don't know the clobberer service, but judging from the docs:
https://api.pub.build.mozilla.org/docs/usage/clobberer/

I would say that it wouldn't integrate well.
But it could probably be something similar.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #3)
> Does this bug cover adding Taskcluster jobs to the Clobberer as well so this
> infra is actually usable? Right now, those jobs aren't listed, so we can't
> clobber them even if it's in theory supported.
> https://api.pub.build.mozilla.org/clobberer

That depends on the approach they decide to take. I'm not sure if they'd want to use clobberer (the utility) here or not. If we'd like to keep clobbering tied to a single interface, that would be the way to do it; but that approach may seem limiting within the context of TC.
I can say with certainty that sheriffs want to be able to clobber TC build slaves the same as buildbot slaves (as in, forced objdir deletion). If we need a new bug to track it, so be it.
@RyanVM,
Can you describe the current workflow for purging caches?
What button do click where, what link? what details do you enter?
We discussed at length on IRC and I think we're all on the same page now. This bug tracks TC being made capable of performing a clobber task and I've filed bug 1174263 for updating the clobberer tool to properly communicate to TC that a clobber has been requested for a given tree/platform.
Depends on: 1174846
I've deployed a taskcluster-purge-cache service that will publish a pulse message.

Once DNS is up, I'll publish docs, add it to docs.tc.net and build a quick tool for purge caches per workerType.
Docs deployed, will be updated when DNS is configured: http://docs.taskcluster.net/services/purge-cache/
Thanks Jonas!

Has someone used this? Is it working as expected?
Looking over the docs it appears that what needs to be implemented within docker-worker:

1. Listen for events on exchange/taskcluster-purge-cache/v1/purge-cache
2. On event, if workertype and provisioner id match then:
  a. if cachename exists and is not mounted, remove
  b. If cache is currently mounted, mark cache as purged and remove when volume is released
  c. never allow tasks to volume mount a cache marked for purging
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
This might also be enough that we don't need to do bug 1151605, which probably a more in-tree driven approach to fix this.
See Also: → 1151605
Assignee: nobody → wcosta
Summary: docker-worker: Listen to pulse exchange to blow away named caches (proposal) → docker-worker: Listen to pulse exchange to blow away (clobber) named caches (proposal)
Status: NEW → ASSIGNED
Attached file PR 72
Attachment #8667447 - Flags: review?(garndt)
Comment on attachment 8667447 [details] [review]
PR 72

This is definitely coming together.  Awesome work picking this up so quickly.  Docker worker definitely isn't the easiest to get started with.  I left some comments on the PR and the CI tests have a couple of failures that need addressing.  Feel free to reflag me once those are addressed.  Thanks!
Attachment #8667447 - Flags: review?(garndt) → review-
Comment on attachment 8667447 [details] [review]
PR 72

All comments addressed.
Attachment #8667447 - Flags: review- → review?(garndt)
Even before the patch, not all tests pass [1]. Running tests under worker-ci locally, purge cache tests pass if I supply the right pulse credentials, I have no idea why they are failing when push to the PR.

[1] https://pastebin.mozilla.org/8848314
Flags: needinfo?(garndt)
It's possible that a lot of those failures are fallout from me rotating creds this morning.  I have restarted the app along with auth...could you try again?
Flags: needinfo?(garndt)
Comment on attachment 8667447 [details] [review]
PR 72

Nice work on this.  Comments have been addressed in the PR.
Attachment #8667447 - Flags: review?(garndt) → review+
https://github.com/taskcluster/docker-worker/commit/3099b91c3d8dd6fc42c82c8bf3b312959ad89bef
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.