Add some sort of caching for gecko checkouts on windows generic-worker jobs

RESOLVED FIXED in Firefox 67

Status

task
RESOLVED FIXED
7 months ago
5 months ago

People

(Reporter: kats, Assigned: ahal)

Tracking

(Depends on 2 bugs, Blocks 1 bug, {leave-open})

unspecified
mozilla67
Dependency tree / graph

Firefox Tracking Flags

(firefox67 fixed)

Details

Attachments

(5 attachments)

We're running some webrender CI stuff on windows via generic-worker (this task) and a good chunk of time is spent checking out and updating the mercurial repo.

I recall seeing a comment by :ahal somewhere that this was an item on a to-do list, but I'm not sure if there's a bug already on file for optimizing that so I'm filing one.

See:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/common.py#89

It's possible that comment is out of date, I see a "WriteableDirectoryCache" in:
https://docs.taskcluster.net/docs/reference/workers/generic-worker/docs/payload

Gps was the one who set this all up for the docker-worker, unfortunately he's not around anymore I'm a bit out of my element.

Brian, is the comment in the first link still accurate? If so is there a bug we can depend on that tracks implementing it? If not, do you have any pointers that can help us set this up?

Flags: needinfo?(bstack)

Good question. I think pmoore is best able to answer this sort of thing.

Flags: needinfo?(bstack) → needinfo?(pmoore)

Yes generic-worker also has caches - the WriteableDirectoryCache link from ahal is the correct one, and is essentially equivalent to the docker-worker cache directive. You simply give the cache a name, for which you require the scope to use it, and then any content in that cache directory will be preserved at the end of the task run, and if a new task comes in on that worker that declares that same cache name, the content will be mounted from the previous task run.

The link should have all the info you need, but let me know if you are missing anything.

Flags: needinfo?(pmoore)

This also adds a cache for the gecko checkout if present.

Depends on D17689

Thanks Pete. Here's an initial stab:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e866038733799f6f6be0cad05c215f55e3b50f3

Here is an example task definition with the cache:
https://tools.taskcluster.net/groups/WeV1wa7ZTBWL1BL1XoceEQ/tasks/LrB2cpcFQyKajBQahdnGNg/details

All those term jobs re-clone mozilla-central, but I guess that's expected because we'd need to wait for a host that has the cache to run a second time and the pool is probably much larger than my handful of retriggers, right? Is there a better way for me to test to see this working? Or should I just land and monitor it after the fact?

Flags: needinfo?(pmoore)

(In reply to Andrew Halberstadt [:ahal] from comment #6)

Thanks Pete. Here's an initial stab:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1e866038733799f6f6be0cad05c215f55e3b50f3

Looks good!

Here is an example task definition with the cache:
https://tools.taskcluster.net/groups/WeV1wa7ZTBWL1BL1XoceEQ/tasks/LrB2cpcFQyKajBQahdnGNg/details

This looks correct. See e.g. these log lines.

All those term jobs re-clone mozilla-central, but I guess that's expected because we'd need to wait for a host that has the cache to run a second time and the pool is probably much larger than my handful of retriggers, right? Is there a better way for me to test to see this working? Or should I just land and monitor it after the fact?

You could make a try push changing worker type gecko-t-win10-64 to gecko-t-win10-64-beta (which is the staging pool for that worker type). The only reason this might help is it is a much more constrained pool, and you could explicitly set it to 2 or 3 workers, for example. I'd be tempted to just land it though, as the jobs are green and the logs look like the cache is being used correctly.

FWIW you can use a patch similar to this one to "swap out" the worker type:
https://hg.mozilla.org/try/rev/28d524a23fd07c16dac5537d84313a70ec4fa830

Note, you'll probably want to shrink the find_replace_dict array to just have the worker type(s) you are enabling caches on.

Also please check with :grenade if he is using the staging worker types for anything at the same time, just so you don't collide. He or I can also assist with setting the max capacity (worker pool size) for the staging worker type(s) you want to test with to something small (e.g. 1-3 workers).

Good luck!

Flags: needinfo?(pmoore)

Ok, I'm more than happy to just land, especially knowing it looks to be working. I'll keep an eye on it over the coming days and if it looks like something is wrong I'll use the worker-type trick then to test it out.

Thanks!

Assignee: nobody → ahal
Status: NEW → ASSIGNED
Attachment #9039217 - Attachment description: Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function → Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function, r?dustin
Attachment #9039218 - Attachment description: Bug 1519472 - [taskgraph] Support generic-worker caches in run_task → Bug 1519472 - [taskgraph] Support generic-worker caches in run_task, r?dustin

Nuts, looks like this causes build failures due to insufficient disk space:
https://tools.taskcluster.net/groups/VHZWELzjSqGJFmm3l51bXg/tasks/WbD7Hwv_TOmBhzmy7txGCQ/runs/0/logs/public%2Flogs%2Flive.log

Not sure if it's intermittent or not. I didn't see this with artifact builds. I guess one easy way to fix this will be to implement a "use_caches" key in the job schema and set it to false for build tasks.

Attachment #9039217 - Attachment description: Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function, r?dustin → Bug 1519472 - [taskgraph] Factor logic for adding a cache in job.common to a new function, r?tomprince

The hosts don't have enough disk space to cache mozilla-central.

Depends on D17689

Attachment #9039218 - Attachment description: Bug 1519472 - [taskgraph] Support generic-worker caches in run_task, r?dustin → Bug 1519472 - [taskgraph] Support generic-worker caches in run_task, r?tomprince
Blocks: 1526028
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0b8097689bb5
[taskgraph] Factor logic for adding a cache in job.common to a new function, r=tomprince
https://hg.mozilla.org/integration/autoland/rev/b6e19a5b0ab9
[ci] Opt out of caching for generic-worker based Windows builds, r=tomprince
https://hg.mozilla.org/integration/autoland/rev/2ceeee1915ae
[taskgraph] Support generic-worker caches in run_task, r=tomprince
Status: ASSIGNED → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

I've just seen comment 9, and realise this is probably a bug in the generic-worker implementation, which I think we'll need to fix before we can roll this out (I think we'll have to back this out).

We have a mechanism that checks we have enough disk space on the partition where task directories are stored, before each task runs. If not, we repeatedly delete caches until we have freed up enough space.

However, in the case that the caches are on a different partition to the task directories, this doesn't prevent the cache partition from filling up.

So we should check the partition with caches, and the partition with task directories, both have enough space at the start of every task.

I'll create a bug for this in the generic worker bugzilla component and make it a blocker for this bug.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&searchStr=windows%2C10%2Cx64%2Cquantumrender%2Crelease%2Cwebrender%2Cstandalone%2Cwebrender-windows%2Cwr%28wrench%29&fromchange=da71b4d4ad402c64c19f686ed6014ec559c1844c&tochange=4bc31addf415cc076ac42ab9ed64002160f57f86&selectedJob=227129866

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227129866&repo=mozilla-central&lineNumber=2190

Backout link: https://hg.mozilla.org/mozilla-central/rev/4bc31addf415cc076ac42ab9ed64002160f57f86

[taskcluster 2019-02-08T10:32:16.918Z] Exit Code: 0
[taskcluster 2019-02-08T10:32:16.918Z] User Time: 0s
[taskcluster 2019-02-08T10:32:16.918Z] Kernel Time: 0s
[taskcluster 2019-02-08T10:32:16.918Z] Wall Time: 39m17.6778452s
[taskcluster 2019-02-08T10:32:16.918Z] Result: SUCCEEDED
[taskcluster 2019-02-08T10:32:16.918Z] === Task Finished ===
[taskcluster 2019-02-08T10:32:16.918Z] Task Duration: 39m17.6788022s
[taskcluster 2019-02-08T10:32:16.920Z] [mounts] Preserving cache: Moving "Z:\task_1549618301\build" to "Y:\caches\QPnek7u7Qe2ICSvyeudjPw"
[taskcluster 2019-02-08T10:39:34.180Z] [mounts] Removing cache level-3-checkouts from cache table
[taskcluster 2019-02-08T10:39:34.180Z] [mounts] Deleting cache level-3-checkouts file(s) at Y:\caches\QPnek7u7Qe2ICSvyeudjPw
[taskcluster:error] [mounts] Could not unmount <nil> due to: 'Could not persist cache "level-3-checkouts" due to mkdir Y:\caches\QPnek7u7Qe2ICSvyeudjPw\src\vs2017_15.8.4\SDK\bin\10.0.17134.0\x64\en-US: There is not enough space on the disk.'
[taskcluster 2019-02-08T10:39:59.316Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/I9UVQSBISBW3F2h2znaPcA/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-02-08T09:51:01.596Z
[taskcluster:error] Could not persist cache "level-3-checkouts" due to mkdir Y:\caches\QPnek7u7Qe2ICSvyeudjPw\src\vs2017_15.8.4\SDK\bin\10.0.17134.0\x64\en-US: There is not enough space on the disk.

Flags: needinfo?(ahal)
Target Milestone: mozilla67 → ---

No need to block on the fix, I can disable caches for those wrench tasks. That way we'll still get the benefit of caches in tasks that don't have this configuration in the meantime. We could also see if moving those srcdir checkout caches to the same partition on Windows is feasible. I'm not really sure why they're placed where they are.

But either way the bug in generic-worker should be fixed, just saying that it doesn't need to block this one.

Flags: needinfo?(ahal)
See Also: → 1526311

If you're disabling the caches for wrench and relanding please also disable it for the searchfox build job which failed similarly: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227131320&repo=mozilla-central&lineNumber=251291

Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/036604abf1e5
[taskgraph] Factor logic for adding a cache in job.common to a new function, r=tomprince
https://hg.mozilla.org/integration/autoland/rev/2053a035eee6
[ci] Opt out of caching for generic-worker based Windows builds, r=tomprince
https://hg.mozilla.org/integration/autoland/rev/887cc76ba189
[taskgraph] Support generic-worker caches in run_task, r=tomprince
Status: REOPENED → RESOLVED
Closed: 6 months ago6 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67
Blocks: 1527313
See Also: → 1527313
No longer blocks: 1527313

Disables caching on generic-worker based Windows builds for Thunderbird
due to insufficient disk space on the build hosts.

Attachment #9043330 - Attachment description: Port Bug 1519472 - Disable caching on TB Windows builds. r=jorgk → Port Bug 1519472 - Disable caching on TB Windows builds. r?jorgk

Hmm, the patch in Phabricator is NOT what you've been running on try :-( - I'll land the latter.

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/d2574be5d927
Port Bug 1519472 - Disable caching on Windows builds. r=jorgk

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #16)

If you're disabling the caches for wrench and relanding please also disable it for the searchfox build job which failed similarly: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=227131320&repo=mozilla-central&lineNumber=251291

There are similar problems on cbindgen jobs: https://queue.taskcluster.net/v1/task/G0j-vl9PS861uEMQ7u-6Dg/runs/0/artifacts/public/logs/live_backing.log

Actually different, as it's not a disk space problem, but a "there's a process still running that uses a file in the checkout"

Depends on: 1527798
Depends on: 1527799

Ironically, this makes windows tasks using a cache slow to finish. See https://taskcluster-artifacts.net/FulbQos7T0ewTbXv746i-Q/0/public/logs/live_backing.log

[taskcluster 2019-02-15T05:24:03.500Z] [mounts] Preserving cache: Moving "Z:\task_1550205030\build" to "Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA"
[taskcluster 2019-02-15T05:28:32.966Z] [mounts] Denying task_1550205030 access to 'Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA'
[taskcluster 2019-02-15T05:29:27.546Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/FulbQos7T0ewTbXv746i-Q/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-15T05:06:20.013Z

5 minutes for tasks possibly depending on the task to start (well, in that case the task failed so nothing was going to start anyways)

I'm going to mark this bug as blocked by bug 1528198 and bug 1526311.

Once both of these bugs are resolved, we should have a much better time using generic-worker caches on Windows.

My feeling is that it isn't wise for us to use this worker feature while these issues are still open.

Status: RESOLVED → REOPENED
Depends on: 1528198, 1526311
Resolution: FIXED → ---

(In reply to Mike Hommey [:glandium] from comment #25)

Ironically, this makes windows tasks using a cache slow to finish. See https://taskcluster-artifacts.net/FulbQos7T0ewTbXv746i-Q/0/public/logs/live_backing.log

[taskcluster 2019-02-15T05:24:03.500Z] [mounts] Preserving cache: Moving "Z:\task_1550205030\build" to "Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA"
[taskcluster 2019-02-15T05:28:32.966Z] [mounts] Denying task_1550205030 access to 'Y:\caches\H6KTLHo4Sh-Vy0PMSQgAjA'
[taskcluster 2019-02-15T05:29:27.546Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/FulbQos7T0ewTbXv746i-Q/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-03-15T05:06:20.013Z

5 minutes for tasks possibly depending on the task to start (well, in that case the task failed so nothing was going to start anyways)

The slowness here will be addressed by bug 1528198.

(In reply to Pete Moore [:pmoore][:pete] from comment #26)

I'm going to mark this bug as blocked by bug 1528198 and bug 1526311.

Once both of these bugs are resolved, we should have a much better time using generic-worker caches on Windows.

My feeling is that it isn't wise for us to use this worker feature while these issues are still open.

Ah, I see I should have done this in bug 1527313 - closing this one again. Sorry for the noise.

Status: REOPENED → RESOLVED
Closed: 6 months ago6 months ago
No longer depends on: 1526311, 1528198
Resolution: --- → FIXED
Depends on: 1528422
Depends on: 1528891
Keywords: leave-open

They appear to be causing tasks to take several hours to complete.

Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/mozilla-central/rev/3b08a133c893
Disable caches on windows repackage builds; r=aki a=tomprince
Duplicate of this bug: 1350956
You need to log in before you can comment on or make changes to this bug.