Closed Bug 1316329 Opened 8 years ago Closed 8 years ago

Fix sccache on taskcluster generic-worker builds so it can get cache hits

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(firefox53 fixed)

RESOLVED FIXED
mozilla53
Tracking Status
firefox53 --- fixed

People

(Reporter: ted, Assigned: ted)

References

Details

Attachments

(2 files)

In bug 1187257 I did some testing and was able to get sccache2 working on the Taskcluster Windows builds in generic-worker. Unfortunately they have zero cache hits because the working directory (under which the srcdir and objdir are placed) contains the Task ID, and sccache uses the full compiler commandline as an input to the hash to generate the cache key. We have plans to fix that (bug 1280640), but it's a fair bit of work.

pmoore suggested that maybe we could make the build tasks create a symlink from a fixed location (like Z:\build) to the current task directory, and then run the build from the symlink directory instead. That seems fairly non-invasive. I'm a little worried that something in the build system might break because of symlinks, but it's worth a try to get a quick build time win from sccache.
Yes, let's do this with some extra commands in the task payload.

We might need to adapt the worker type too to provide a fixed location that is writable by all users (this would require a PR in https://github.com/mozilla-releng/OpenCloudConfig).

I'm thinking adding something to the start of the task that:

1) Creates the symlink at the chosen frozen location, to point to %CD% (the task user home directory)
2) cd into that directory
3) execute preexisting build steps...

This can be implemented as a change to the decision task, i.e. in-tree changes to the decision task creation process.

Probably easiest is to try the changes manually first with some manual task submissions in the Task Creator (https://tools.taskcluster.net/task-creator/), before building them into the decision task.

CC'ing grenade for advice about worker type changes.
We'd probably want to add the symlink commands here, in the mozharness_on_windows transform:
https://dxr.mozilla.org/mozilla-central/rev/783356f1476eafd8e4d6fa5f3919cf6167e84f8d/taskcluster/taskgraph/transforms/job/mozharness.py#205

We should probably do this the same way as proposed in bug 1269355 comment 1, by honoring the 'use-sccache' property of the task and only adding these commands if that's present.
If someone else makes the worker type changes (or points out an existing location we can use that wouldn't require changes) I will do the work to test the symlink commands.
Summary: Fix sccache on taskcluster generic-worker so it can get cache hits → Fix sccache on taskcluster generic-worker builds so it can get cache hits
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #3)
> If someone else makes the worker type changes (or points out an existing
> location we can use that wouldn't require changes) I will do the work to
> test the symlink commands.

Rob probably knows best, in terms of what might already be available.

The worker type is defined in e.g. https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/Manifest/gecko-1-b-win2012.json but I'm not sure where the code is that mounts the Z: drive and sets ACLs, as this happens during worker runtime, rather than on AMI creation.
Flags: needinfo?(rthijssen)
previous attempts to have the payload create symlinks resulted in permissions errors due to the task user not having rights to create symlinks. i believe this is now fixed with https://github.com/mozilla-releng/OpenCloudConfig/commit/7cb5df0506923c79df9edf5b3e5b521685306011 where we grant that access. the z: drive is mounted long before the payload starts being processed and any user can write to the ephemeral drives so it should just be a case of creating the links in the payload.
Flags: needinfo?(rthijssen)
Assignee: nobody → ted
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #7)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=5d427db09788

The first set of builds worked OK, so I retriggered all of the builds and they were able to get cache hits, like in:
https://public-artifacts.taskcluster.net/cwCqX111RxWqGXnb72kGIQ/0/public/logs/live_backing.log

22:13:37     INFO -  Cache hits                                                                                                                                                                           3186

The builds were all significantly faster too, as expected.
Comment on attachment 8810672 [details]
bug 1316329 - build from a symlink dir in generic-worker builds.

https://reviewboard.mozilla.org/r/92960/#review93486

::: taskcluster/taskgraph/transforms/job/mozharness.py:230
(Diff revision 1)
> +    # cache hash key, so create a symlink to the task dir and build from
> +    # the symlink dir to get consistent paths.
> +    if taskdesc.get('needs-sccache'):
> +        worker['command'].extend([
> +            r'if exist z:\build rmdir z:\build',
> +            r'mklink /d z:\build %cd%',

Each task runs as a different user, so I could imagine when we create the folder, we might need to also add an ACL to make it writable by the Everyone group, so that a future task user can delete it.

To test this, we'll need a worker that runs two jobs, and since you can't guarantee getting the same worker twice, maybe we should set up a dedicated worker type with maxCapacity of 1 so that we can make a couple of try pushes.

This also requires Z:\ to be writable by all - I hope this is the case already - if not, we can tweak the DSC that sets up the image in github.com/mozrelops/OpenCloudConfig repo.
Attachment #8810672 - Flags: review?(pmoore)
Per comment 5 it sounds like write permissions shouldn't be a problem. I don't actually know what the permission situation is like for removing a symlink--I would assume that if a user has write permission to the containing directory they would have permission to remove the symlink, but I don't actually know! If you want to set up a dedicated worker and run those tests that would be great. You should be able to just duplicate the tasks from my try push and change the worker type.
As I'm on PTO on Monday, I've made a try push today:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=da612b47c461345fc0081115193f75f67ed03050

This just changes the worker type to gecko-1-b-win2012-beta (instead of gecko-1-b-win2012) - and I believe the beta worker type is essentially on the same version as the non-beta type. I set maxCapacity to 1 for the beta worker type, so since this will create multiple builds in serial fashion, I'm hoping we'll see if the tasks have permission to delete the symlink created by the previous task...

As I'll be out on Monday, speak to grenade if you have questions about the *-beta worker types, as he kindly set them up.
Hey Ted, in my try push with your change, I didn't see your new code kicking in:

https://public-artifacts.taskcluster.net/e0mo6vEcQYCrQ9fj2HFvOQ/0/public/task-graph.json

Did I do something wrong in my push in comment 12?

Thanks!
Flags: needinfo?(ted)
Ah yes! You need the patch from bug 1269355 as well, that's what sets the 'needs-sccache' flag on the task definition. (I realize I didn't actually post that patch, I've only pushed it to try.)
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #16)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=8095dcbe9a5a

pmoore's suspicions were correct, and we did need to add a command to set an ACL on the symlink so that other task users would be able to remove it. With that added this try push seems to be progressing reasonably well.
There is a unit test failure on those jobs that I hadn't paid attention to until pmoore asked about it--it's a legitimate failure and it's pointing out something that would be broken, so I'm glad I wrote that test! We call the Win32 API function `GetFinalPathNameByHandleW` to normalize the case of source file names, but in this case that will give us back the path with the symlink resolved. The script that dumps symbols for crash reporting uses simple prefix matching to check if source files fall within the source repository, and the source repository path is under the symlink, so that fails.

It shouldn't be a terribly hard fix--we just need to resolve the srcdir path the same way.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=ca5405acc96a

This push has a patch to fix that issue and the builds are green.
Comment on attachment 8813475 [details]
bug 1316329 - follow symlinks for srcdir path in symbol dumping on windows.

https://reviewboard.mozilla.org/r/94888/#review95094
Attachment #8813475 - Flags: review?(gps) → review+
Comment on attachment 8810672 [details]
bug 1316329 - build from a symlink dir in generic-worker builds.

https://reviewboard.mozilla.org/r/92960/#review95096
Attachment #8810672 - Flags: review+
Comment on attachment 8810672 [details]
bug 1316329 - build from a symlink dir in generic-worker builds.

https://reviewboard.mozilla.org/r/92960/#review95188

Looks good, thanks Ted!
Attachment #8810672 - Flags: review?(pmoore) → review+
https://hg.mozilla.org/mozilla-central/rev/e5a62d4ce95a
https://hg.mozilla.org/mozilla-central/rev/f4c1cb96ec82
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: