Make use of scope-protected caches in Windows builds

ASSIGNED
Assigned to

Status

ASSIGNED
2 years ago
7 months ago

People

(Reporter: pmoore, Assigned: pmoore)

Tracking

({good-first-bug})

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

We currently have caches in our Windows builds, but they are not protected by scopes, rather any task that runs on a given worker, could potentially poison a cache. They are currently regular folders that are read/writable to all users, at a global location on the filesystem.

The docs for mounts/caches in generic-worker can curently be found here:
  * https://docs.taskcluster.net/reference/workers/generic-worker/payload

A real-world example of a generic-worker scope-protected, non-preloaded cache is here:
  * https://github.com/taskcluster/taskcluster-worker/blob/319fa133caa43f78368f77dff02a6e8da1152faa/.taskcluster.yml#L106-L107

The above example is from the taskcluster-worker CI, which currently runs under generic-worker (i.e. this is a generic-worker feature not a taskcluster-worker feature).

It looks like e.g. vcs caching is handled in gecko task generation here, but currently only supports docker worker:
  * https://hg.mozilla.org/mozilla-central/file/9577ddeaafd8/taskcluster/taskgraph/transforms/job/common.py#l103

There is probably also something similar for object caches.

If we can get the gecko task generation to take advantage of generic worker caches for object caches and vcs caches, we should gain more task transparency (more defined in task, less defined in the AMI) which also improves portability, and reusability of worker types, as well as increasing security by having caches protected by scopes.
Is the action here to implement scope protection, or to use the caches in the in-tree taskgraph generation even though they are not scope-protected?
Flags: needinfo?(pmoore)
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Is the action here to implement scope protection, or to use the caches in
> the in-tree taskgraph generation even though they are not scope-protected?

The in-tree task generation of gecko should generate generic-worker build tasks that mount writable vcs cache and object directory caches via the "mounts" pragma of the task payload. Use of writable cache <cache-name> necessarily requires scope generic-worker:cache:<cache-name>, which provides fine-grained access control.
Flags: needinfo?(pmoore)
Ah, I misunderstood comment 0 to say that generic-worker didn't support scope-protected caches.  It does, but we're not using them in-tree.  Thanks!
Mentor: dustin
Component: General Automation → Task Configuration
Keywords: good-first-bug
Product: Release Engineering → Taskcluster
QA Contact: catlee
Assignee: nobody → pmoore
I'm happy to take a stab at this, with my new knoweldge of the in-tree task generation process from bug 1349980.

I'll be on PTO for a week after today, so won't get done until later in the month, as I need to sort out rebooting on generic worker on OS X in scl3 first before I leave.
Found in triage.

Pete: is this still on your radar, or should it go back in the pool?
Flags: needinfo?(pmoore)
Yeah, it should probably go back in the pool! Thanks for following up!

I believe windows builds still don't use worker caches. :/

Not sure how much quicker they would be if they did. It might not be worth it, for the pain that dirty caches can cause.

Maybe we should assign this to the build team?
Assignee: pmoore → nobody
Flags: needinfo?(pmoore) → needinfo?(coop)
This is largely assigned to the build team -- this component is part of a Build submodule.

Caching tends to markedly increase build times, so if we're not doing caching at all right now, that's probably a big win.
Mentor: dustin
Flags: needinfo?(coop)
I think I need to look at this.
Flags: needinfo?(gps)
(In reply to Pete Moore [:pmoore][:pete] from comment #0)
> The docs for mounts/caches in generic-worker can curently be found here:
>   * https://docs.taskcluster.net/reference/workers/generic-worker/payload

This view now seems to be empty - perhaps related to bug 1440028.

In the meantime, you can see the raw payload jsonschema here:

  * http://schemas.taskcluster.net/generic-worker/v1/payload.json

It isn't as pretty as when it is htmlified - but it should be enough to make it clear how to mount caches.



(In reply to Gregory Szorc [:gps] from comment #8)
> I think I need to look at this.

Awesome, thanks Greg!

Updated

a year ago
Product: TaskCluster → Firefox Build System
Note, we had another tree closure due to bug 1417881 today, with more poisoned caches, which I believe would be solved by this bug.
Do you have an ETA for looking into this Greg?
I started poking at this yesterday. I have some patches in the works. Hopefully I'll get something up for review by end of day.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Flags: needinfo?(gps)
(In reply to Pete Moore [:pmoore][:pete] from comment #9)
> (In reply to Pete Moore [:pmoore][:pete] from comment #0)
> > The docs for mounts/caches in generic-worker can curently be found here:
> >   * https://docs.taskcluster.net/reference/workers/generic-worker/payload
> 
> This view now seems to be empty - perhaps related to bug 1440028.

This has since been fixed, so docs are available again here:

https://docs.taskcluster.net/reference/workers/generic-worker/docs/payload

Updated

10 months ago
Depends on: 1436037
Before it pages out of my brain over the weekend, here's where I got yesterday...

Setting up caches for the VCS checkout on Windows workers will be a bit of work.

The reason is that each task gets its own (temporary) directory for the "workspace" for that task (e.g. z:\task_1234567). And because sccache needs the absolute paths within builds to remain consistent across builds in order to achieve cache hits, we symlink z:\build to the per-task directory. An existing z:\build is blown away.

Each task is also run as a new user. So any files shared between tasks must have ACLs set that allow all future users to access/modify/etc those files.

So in order to leverage a VCS checkout cache, we need to create a new place outside of the per-task directories to hold that cache. Then we'll need to symlink that cache into a consistent path (e.g. z:\build\...) so paths are consistent across tasks.

Sometime yesterday I had convinced myself that by the time I hacked up taskgraph to do all the VCS checkout and cache maintenance, symlink muckery, etc that is required to implement this properly, I would have reinvented significant parts of run-task. We have long wanted to leverage run-task outside of Docker tasks. So I started hacking on getting run-task to work on Windows/generic-worker. That's why bug 1436037 is now a blocker to this one.

My intent is to start using run-task on *all* generic-worker tasks (or maybe just the Windows ones for now). Then we can move the VCS logic for Windows out of taskgraph and into run-task, making all the VCS bits consistent across worker types. It will also allow us to move some low-level Windows batch scripting out of taskgraph and into run-task.
(In reply to Gregory Szorc [:gps] from comment #14)
> Before it pages out of my brain over the weekend, here's where I got
> yesterday...
> 
> Setting up caches for the VCS checkout on Windows workers will be a bit of
> work.
> 
> The reason is that each task gets its own (temporary) directory for the
> "workspace" for that task (e.g. z:\task_1234567). And because sccache needs
> the absolute paths within builds to remain consistent across builds in order
> to achieve cache hits, we symlink z:\build to the per-task directory. An
> existing z:\build is blown away.

Yes, that is what we are currently doing indeed.

> Each task is also run as a new user. So any files shared between tasks must
> have ACLs set that allow all future users to access/modify/etc those files.

This is taken care of by the worker - you just need to declare a writable cache directory, and the worker will take care of setting ACLs so that the new task user has full control of it.

> So in order to leverage a VCS checkout cache, we need to create a new place
> outside of the per-task directories to hold that cache. Then we'll need to
> symlink that cache into a consistent path (e.g. z:\build\...) so paths are
> consistent across tasks.

I don't think so. When you mount it at e.g. hg-shared, it will appear under Z:\build\hg-shared which will remain a fixed location as far as the build system is concerned. Internally it would be mounted at Z:\task_XXX\hg-shared, but due to the symlinking, it will be visible to the build system at Z:\build\hg-shared.

> Sometime yesterday I had convinced myself that by the time I hacked up
> taskgraph to do all the VCS checkout and cache maintenance, symlink muckery,
> etc that is required to implement this properly, I would have reinvented
> significant parts of run-task. We have long wanted to leverage run-task
> outside of Docker tasks. So I started hacking on getting run-task to work on
> Windows/generic-worker. That's why bug 1436037 is now a blocker to this one.

As above, I believe all these pieces that you mention are already taken care of. I don't think you have to do any of them yourself.

> My intent is to start using run-task on *all* generic-worker tasks (or maybe
> just the Windows ones for now). Then we can move the VCS logic for Windows
> out of taskgraph and into run-task, making all the VCS bits consistent
> across worker types. It will also allow us to move some low-level Windows
> batch scripting out of taskgraph and into run-task.

That sounds awesome too.

Thanks Greg!
Flags: needinfo?(gps)
I really think all you need is to add this to the build task payload:

      mounts:
        - cacheName: hg-shared-gecko (feel free to choose a different name)
          directory: hg-shared

And then in the build system, wherever you currently use C:\hg-shared, use Z:\build\hg-shared instead.

All ACLs, mounting, persisting of cache between tasks etc, will all be taken care of for you.
(In reply to Pete Moore [:pmoore][:pete] from comment #16)

> All ACLs, mounting, persisting of cache between tasks etc, will all be taken
> care of for you.

https://github.com/taskcluster/generic-worker/blob/v10.7.8/mounts.go#L393-L444
https://github.com/taskcluster/generic-worker/blob/v10.7.8/plat_windows.go#L510-L519
Thank you very much for all that context, Pete! It is very helpful! It sounds like I'll be able to reduce scope bloat in this bug due to generic-worker supporting nice things. I'll definitely have to tweak my patches a bit.
Flags: needinfo?(gps)
You're welcome! :-)
I still intend to hack on this. I just got sucked into the rabbit hole of bug 1460777. In my mind, all these problems around early task activity are very intertwined. The patches I had written for this bug also serve to benefit from Pete's excellent work in bug 1459375 (I was using the "mounts" feature to install run-task in the worker).

Updated

9 months ago
Depends on: 1462791
I've taken a stab at this in https://treeherder.mozilla.org/#/jobs?repo=try&revision=b56d4a14f26df7759e3d1335380aa07a8f3c5d97

I expect it might take several iterations to get right, but wanted to share the link so you know I'm having a go at it.
Created attachment 8984050 [details] [diff] [review]
bug1350956_gecko_v1.patch

Hi Greg,

This patch uses a worker cache for hg-shared for builds, but I saw that e.g. spidermonkey tasks (and some other tasks) also use y:\hg-shared. This initial patch doesn't migrate those tasks, only builds.

Also this patch only tackles using a cache for hg-shared, it doesn't introduce worker caches for e.g. object directory caches. We should probably do that too. The goal in the end should be that all caches used by tasks are worker caches, rather than folders on the host environment with global read/write for all users.

But I figured this is a good place to start. :-)

Thanks.
Assignee: gps → pmoore
Attachment #8984050 - Flags: review?(gps)
Comment on attachment 8984050 [details] [diff] [review]
bug1350956_gecko_v1.patch

Review of attachment 8984050 [details] [diff] [review]:
-----------------------------------------------------------------

This seems like a step in the right direction.

FWIW, I don't think there is a substantial difference between y: and z: for the cache location. I think these paths could be consolidated and standardized. But this is follow-up fodder.
Attachment #8984050 - Flags: review?(gps) → review+
(In reply to Gregory Szorc [:gps] from comment #24)
> Comment on attachment 8984050 [details] [diff] [review]
> bug1350956_gecko_v1.patch
> 
> Review of attachment 8984050 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This seems like a step in the right direction.
> 
> FWIW, I don't think there is a substantial difference between y: and z: for
> the cache location. I think these paths could be consolidated and
> standardized. But this is follow-up fodder.

There could be a significant difference performance-wise; currently, on these gecko worker types, task folders are on Z: drive, and caches are on Y: drive. That means every time a cache is persisted after a task completes, it needs to be moved from Z: drive to Y: drive which implies COPY/DELETE semantics for the entire folder hierarchy, rather than a rename of the parent folder, which would almost instantaneous if they were on the same drive.

IIRC, the split across Y: and Z: was so that Z: drive could be formatted between runs, as this was considered more robust at the time this was implemented. I suspect this is no longer the case, and performance overall could be improved if tasks and caches were on the same drive, and we didn't format drives between tasks. The only way to know for sure would be to collect some data and compare results.
I had no idea there was this much going on behind the scenes in the worker. There's a lot of major foot guns there waiting to injure people :/

Copying the Mercurial cache would be *very* expensive and would drastically decrease worker efficiency.

I think things would be much better if we had a single drive (that wasn't formatted between tasks). This model would be simpler and easier to understand and reason about. Not to mention it would be cheaper (since we wouldn't need to mount multiple EBS volumes). It would also be compatible with the new [cm]5d instances, which have a single ephemeral drive of NVMe storage.

(In reply to Gregory Szorc [:gps] from comment #26)
> I had no idea there was this much going on behind the scenes in the worker.
> There's a lot of major foot guns there waiting to injure people :/
> 
> Copying the Mercurial cache would be *very* expensive and would drastically
> decrease worker efficiency.
> 
> I think things would be much better if we had a single drive (that wasn't
> formatted between tasks). This model would be simpler and easier to
> understand and reason about. Not to mention it would be cheaper (since we
> wouldn't need to mount multiple EBS volumes). It would also be compatible
> with the new [cm]5d instances, which have a single ephemeral drive of NVMe
> storage.

I would also like to see us moving to fewer drives. My preference would be just a C: drive, but removing one of the two ephemeral drives already reduce some of the complexity.

I feel the decision about which drives we use should be data-driven. If we can show that having one ephemeral drive improves performance over just running everything on C: then it is probably worth having it, even if there is some added complexity. But if the numbers come in roughly the same, I think it would be better to just have a single drive. This is certainly simpler and may also more realistically reflect a typical user setup. If/when we deploy to other cloud providers, we will have less work to do if the mechanics of mounting ephemeral storage differs between them. OCC contains some code to mount drives on workers when they boot up. Not needing that could remove some complexity / maintenance cost / points of failure etc.

Can you create a RelOps bug for that with the details? Probably the RelOps OpenCloudConfig component is the best one to file it against.
Flags: needinfo?(gps)
Filed bug 1479889.
Flags: needinfo?(gps)
You need to log in before you can comment on or make changes to this bug.