Closed Bug 1527313 Opened 2 years ago Closed 5 months ago

Enable sourcedir caches for Windows generic-worker builds

Categories

(Firefox Build System :: Task Configuration, task, P1)

task

Tracking

(firefox75 fixed)

RESOLVED FIXED
mozilla75
Tracking Status
firefox75 --- fixed

People

(Reporter: ahal, Assigned: tomprince)

References

Details

(Whiteboard: [ci-costs-2020:done])

Attachments

(24 files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

In bug 1519472 I turned on caching for generic-worker tasks. However there's a bug in generic-worker that was causing these builds to fail. The error happens because the cache is set to a different volume than the workspace.

We can either wait for the fix in bug 1526311 or try to move the cache from Y: to Z:. In bug 1526311#c1 it looks like the fact the checkout happens on Y: might have been a hack to work around an issue anyway, and keeping it on Z: is more optimal. Might be worth investigating whether switching back to Z: causes any problems.

Depends on: 1526311
No longer depends on: 1519472
Depends on: 1528198

(In reply to Andrew Halberstadt [:ahal] from comment #0)

In bug 1519472 I turned on caching for generic-worker tasks. However there's a bug in generic-worker that was causing these builds to fail. The error happens because the cache is set to a different volume than the workspace.

We can either wait for the fix in bug 1526311 or try to move the cache from Y: to Z:. In bug 1526311 comment 1 it looks like the fact the checkout happens on Y: might have been a hack to work around an issue anyway, and keeping it on Z: is more optimal. Might be worth investigating whether switching back to Z: causes any problems.

I've added bug 1528198 as a dependency, which is the bug to move caches from Y: to Z:. That bug is blocked on bug 1433854, which is another generic-worker fix.

(In reply to Pete Moore [:pmoore][:pete] from comment #1)

I've added bug 1528198 as a dependency, which is the bug to move caches from Y: to Z:. That bug is blocked on bug 1433854, which is another generic-worker fix.

It turns out bug 1528198 does not depend on bug 1433854 after all, so we can probably go ahead with bug 1528198, which also removes the dependency on bug 1526311, since that only exists when the caches are on a different drive.

So I think we only need to implement bug 1528198 to unblock this bug. \o/

No longer depends on: 1526311

Please don't enable caches without doing something about bug 1528891. The current situation is a mess.

  • Many (most?) build scripts assume some paths, ignoring whatever taskcluster sets in the environment
  • Windows builds are set up differently from docker-worker builds
  • Both Windows builds and docker-worker builds setups wrt caches are broken, in different ways
  • Docker worker builds set a checkout cache and a workspace cache. The former is shared across all job types. The latter is not. Most builds checkout in the workspace cache directory, which is actually a good thing because they also build in there, which means the checkout is polluted with build artifacts. Lots of them.
  • Windows builds, with caches enabled, set a checkout cache and nothing else, shared across all job types. Builds happen in that directory. Which means build artifacts are being cached along the checkouts.
  • Interestingly, we do checkout in a way that cleans up build artifacts, so they don't actually after subsequent builds, but that means we may spend an enormous amount of time cleaning up a large amount of data that we cached... for nothing. (and with the slow I/O on Windows, this is noticeable)
  • While bug 1519472 disabled windows caches on some task kinds, it didn't do so on all windows builds... most notably, use-caches is still true for spidermonkey jobs.

It turns out the setup on docker workers is actually fine, because robustcheckout does put the hg-share in the checkout cache. But per bug 1528891, Windows workers don't do that.

I think the way to address this move the objdir out of the source directory, so that we can cache the source directory, and the object directory doesn't get cached. This might improve the behavior slightly on linux, as well.

Priority: -- → P3

(In reply to Tom Prince [:tomprince] from comment #5)

I think the way to address this move the objdir out of the source directory, so that we can cache the source directory, and the object directory doesn't get cached. This might improve the behavior slightly on linux, as well.

Tom, is this something we can resurface as a higher priority? If you're super swamped I completely understand, but maybe consider it for the near future. It would be great if we could reduce our rather long Windows build times a bit, it's not clear how big of an impact this would have but it looks like maybe 5-10 minutes.

Flags: needinfo?(mozilla)

Jordan, could you please add this to your roadmap for H1? 5-10 minutes of Windows build would save a lot of money.
thanks

Flags: needinfo?(jlund)
Priority: P3 → P1

Sorry I was on PTO on Jan 2nd and 3rd.

Tom and I will discuss this this week. Leaving needinfo open and reply with an update tomorrow (Jan 7th).

Tom met with some build and dev workflow folks to try and get a better idea of the options we have, what they can do, and what work would be involved on our side.

Tom, could you update status here so we are all on the same page.

We can allocate time for releng bits in H1.

Flags: needinfo?(jlund)

I chatted with the build team. objdir outside the checkout should be supported (and there are people that use this configuraiton locally.

It should be possible to adjust the taskcluster+mozharness config to point to a new objdir and the build should work, though it would be good to verify via diffoscope that that is in fact the case.

Flags: needinfo?(mozilla)

Note that some commands do always create an objdir under the source directory for virtualenvs. Those would need to be changed to be able to create them outside the source directory.

As a data point to validate the potential benefit here: it looks like the median number of tasks performed by the windows build workers is 5 tasks per instance.

the median number of tasks performed by the windows build workers is 5 tasks per instance.

That doesn't seem like a lot.

(In reply to Mike Hommey [:glandium] from comment #11)

Note that some commands do always create an objdir under the source directory for virtualenvs. Those would need to be changed to be able to create them outside the source directory.

Afaik, they'll create the objdir wherever the objdir is supposed to exist, as determined by MozbuildObject.from_environment().topobjdir. So we shouldn't need to worry about this as long as the build system is configured to use .

The builds have MOZ_OBJDIR set to obj-firefox, and that doesn't prevent obj-x86_64-pc-linux-gnu from being created on builders (I've seen it on loaners).

Interesting, here's where the virtualenv path is defined:
https://searchfox.org/mozilla-central/rev/f98dad153b59a985efd4505912588d4651033395/python/mozbuild/mozbuild/base.py#269

I see configure instantiates a few of these classes, but appears to use the same path as well. So I guess for whatever reason, that MOZ_OBJDIR isn't getting picked up by the MozbuildObject class.

Keywords: leave-open
Blocks: 1614463

The alternative code path was unused (as demonstrated by the presence of
manifestdestiny package). Remove that code path, so we can fail with a better
error message, if we don't have the right path to the requirements file.

Assignee: nobody → mozilla
Status: NEW → ASSIGNED

The alternative code path was unused (as demonstrated by the presence of
manifestdestiny package). Remove that code path, so we can fail with a better
error message, if we don't have the right path to the requirements file.

The build tasks already use that spelling, so make the naming consistent.

The original code hard-coded the path of the source directory. Instead, use the
actual source directory.

Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/5db1dc4cff3d
[mozharness] Fail if we can't find the mozbase requirements file; r=ahal
https://hg.mozilla.org/integration/autoland/rev/40216892b9f4
[mozharness] Fail if we can't find the marionette requirements file; r=ahal

In automation, GECKO_PATH always refers to the source directory, so use that in mozharness
rather than assuming it is somewhere relative to work_dir.

Attachment #9125653 - Attachment description: Bug 1527313: Don't hardcode `MOZ_OBJDIR` in mozharness configs; r?Callek → Bug 1527313: [mozharness] Don't hardcode `MOZ_OBJDIR` in mozharness configs; r?Callek
Attachment #9125687 - Attachment description: Bug 1527313: [mozhanress] Find multi-l10n source files relative to source directory; r?Callek → Bug 1527313: [mozharness] Find multi-l10n source files relative to source directory; r?Callek
Attachment #9125746 - Attachment description: Bug 1527313: [mozhanress] Use `GECKO_PATH` consitently to find the source directory; r?Callek → Bug 1527313: [mozharness] Use `GECKO_PATH` consitently to find the source directory; r?Callek

We use the shell to expand this, rather than substituting the value here,
because GECKO_PATH will be set in the run_task transform (after the rest of this
stack lands).

We set it here, rather than depending on taskcluster script to set it, so that
we can use it to construct the objdir we will use.

This moves the object directory and source directories around in all mozharness
jobs, to allow enabling caching on windows builders.

This makes a number of changes that all need to land at once:

  • Move the source checkout for the workspace cache mount, to the checkouts
    cache mount.
  • Makes the object directoy from underneath the source directory, to directly
    in the work directory (which is still under workspace).
  • Sets the object directory to obj-build instead of obj-firefox.
  • Stops caching the workspace directory (it is still a volume in docker workers,
    so writes perform well; a followup revision add some checks around this).
  • Removes one level of directory in the mozharness workdir (things were under
    workspace/build, but are now just under workspace/.
  • Adjust paths in environment variables and artifact specifications to match
    the above changes.

Since mozharness tasks are no longer caching the workspace directory, we don't
need a key for different tasks.

Since the workspace is no longer cached, but needs to be a volume for
performance reasons, add a check to ensure that is the case.

These jobs only use the workspace for obj directories, have some some logic to
cleanup the directory at the beginning of the run, so there is no reason to
cache the directory.

This also removes the now-unused common code for creating a workspace cache directory.

(In reply to Mike Hommey [:glandium] from comment #15)

The builds have MOZ_OBJDIR set to obj-firefox, and that doesn't prevent obj-x86_64-pc-linux-gnu from being created on builders (I've seen it on loaners).

I think this is because linux workers run mozharness with mach, so at that point mach is being run without a .mozconfig, and does not appear to honor the MOZ_OBJDIR in the environment at that point, in that state.

Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/7989682c179c
[taskgraph] Use `artifact-reference` in upload-sources task; r=Callek
https://hg.mozilla.org/integration/autoland/rev/5625ed988192
Consolidate multi-locale config, since it does not vary by branch; r=Callek
https://hg.mozilla.org/integration/autoland/rev/e72c5f59ff11
[mozharness] Use `abs_src_dir` instead of `abs_mozilla_dir` in l10n; r=Callek
https://hg.mozilla.org/integration/autoland/rev/073a74ec38c4
[mozharness] Find multi-l10n source files relative to source directory; r=Callek
https://hg.mozilla.org/integration/autoland/rev/07bfd4b77f6c
[mozharness] Rename `abs_objdir` to `abs_obj_dir`; r=Callek
https://hg.mozilla.org/integration/autoland/rev/2e2bf618e78c
[mozharness] Use `GECKO_PATH` consitently to find the source directory; r=Callek
https://hg.mozilla.org/integration/autoland/rev/9ce1849d70cc
[mozharness] Unconditionally set `abs_obj_dir` in l10n repcks; r=Callek
https://hg.mozilla.org/integration/autoland/rev/7dd467519723
[mozharness] Don't hardcode `MOZ_OBJDIR` in mozharness configs; r=Callek
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/c290aa0e39be
Reduce extraneous differences in taskcluster mozharness scripts; r=Callek
https://hg.mozilla.org/integration/autoland/rev/328669c39fa8
[mozharness] Checkout l10n-central next to mozilla-central; r=firefox-build-system-reviewers,mshal
https://hg.mozilla.org/integration/autoland/rev/d8b482f1e496
[taskgraph] Use `GECKO_PATH` to find the taskcluster mozharness script; r=Callek
https://hg.mozilla.org/integration/autoland/rev/182892303a6e
[taskgraph] Set `$WORKSPACE` for mozharness tasks in taskgraph; r=Callek
Whiteboard: [ci-costs-2020:done]

This is instead of extracting them in the gecko source directory, where they
will get deleted by the next task anyway.

Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/9339904adde3
Pass workdir down to multi-l10n script; r=Callek
https://hg.mozilla.org/integration/autoland/rev/f2a1721f5d43
Use `GECKO_PATH` based directories in more places in android pgo; r=nalexander
https://hg.mozilla.org/integration/autoland/rev/5ee475239171
Set `MOZ_OBJDIR` explicitly in release-source tasks; r=Callek
https://hg.mozilla.org/integration/autoland/rev/278f9917616e
Adjust openh264 build to not assume the the source checkout is in work-dir; r=Callek
https://hg.mozilla.org/integration/autoland/rev/42db12ccc1be
Move objdir out of source directory for all mozharness builds; r=glandium,Callek
https://hg.mozilla.org/integration/autoland/rev/66b544de8846
[taskgraph] Remove workspace-key from mozharness tasks; r=glandium
https://hg.mozilla.org/integration/autoland/rev/424f2ad03fe8
[taskgraph] Ensure that the mozharness workspace is part of a docker volume; r=glandium
https://hg.mozilla.org/integration/autoland/rev/f4c911c975a3
Adjust openh264 build to extract tools in uncached workspace; r=Callek
https://hg.mozilla.org/integration/autoland/rev/53bbc6da5941
Enable caching on windows builds; r=glandium
https://hg.mozilla.org/integration/autoland/rev/fbe3fcbb31d8
[taskgraph] Don't use a workspace cache for hazard builds; r=glandium
Keywords: leave-open
Regressions: 1620538
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/dfaa6a437e00
Fix Windows cross-compilation bustage. patch by glandium on Matrix
Regressions: 1621342
Duplicate of this bug: 1528891
Regressions: 1624910
You need to log in before you can comment on or make changes to this bug.