Closed Bug 1390700 Opened 7 years ago Closed 7 years ago

Use sparse checkouts for decision task

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla57

People

(Reporter: gps, Assigned: gps)

References

(Blocks 2 open bugs)

Details

Attachments

(5 files)

Mercurial 4.3 has experimental support for sparse checkouts. These are working directories that contain a subset of all the files in a revision. Use of sparse checkouts means `hg update` operations only need to touch a subset of files, which means they should be much faster. Widespread use of sparse checkouts has the potential to drastically decrease time spent in VCS operations. It can take minutes to create >100k files in the working directory. And for most operations, only a subset of files in the working directory are relevant, so creating extra files is just wasteful. Sparse checkouts should unblock us from using VCS checkouts for running tests. We could never run WPT from checkouts (bug 1286900) because of perf issues creating the checkout. Sparse checkouts will enable us to only manifest the files we need, which should make using version control from tests feasible. In this bug, I want to get the decision task using sparse checkouts. Why the decision task? It should be pretty well isolated. It uses Linux (which is easy for in-tree hacking). And it's in the critical path for running things, so any speed improvement has an immediate benefit. Off the top of my head, this will involve: * New Docker image with Mercurial 4.3 (decision task is special and image needs manual upload) * In-tree definition of sparse profiles for Mercurial (not sure where to put these yet) * Robustcheckout awareness for sparse checkout, how to widen/narrow checkout as appropriate * run-task awareness for which sparse profile to use * New TC caches for sparse-aware repos (repos with sparse enabled have a new requirement that will lock out older Mercurial clients) * Mechanism in taskgraph YAML for specifying sparse profile to use
What would that profile look like? The decision task is very close to using moz.build, which I think would need the entire tree..
The Mercurial sparse profiles essentially define patterns of files to include or exclude. It can support things like **/moz.build if we need all moz.build files, for example. With Mercurial 4.3, run `hg help -e sparse` for more info. And stating it for the record, sparse checkouts are only a medium term solution for scaling. Infinite scalability means working directories backed by a virtual filesystem that realizes file content on first access. That way, you can access any file at any time without having to worry about tracking it in a profile. However, that world effectively reverts distributed version control to something more centralized and/or requires virtual filesystem support, which we can't easily get in TaskCluster now. Plus Mercurial's virtual filesystem working directory support is still manifesting. There's also a non-complete clone involved in there somewhere.
I have a working proof-of-concept at https://treeherder.mozilla.org/#/jobs?repo=try&revision=952e0746f3e2e70be0dc0202cdda561711b0a071. Example sparse profile is https://hg.mozilla.org/try/file/401b0507166257b9ee8f18603144a6d58c1c3b76/build/sparse-profiles/taskgraph. That manifested 3553 files in the working directory instead of 234137, saving a few dozen seconds on fresh checkout. The patches are super hacky. Don't judge me on their quality :) There are a few "random" files being opened by taskgraph that I was somewhat surprised about. Also, with sparse checkouts, we'll have to be extra cautious of anything doing a filesystem walk or VCS query to discover files. e.g. if we assemble a manifest of **/moz.build files and sparse checkouts are in play, the manifest content varies depending on the sparse checkout configuration. This could come into play for things like computing the Docker image build context archive hashes (which taskgraph does).
Depends on: 1391394
Depends on: 1392886
Assignee: nobody → gps
Status: NEW → ASSIGNED
I'm almost ready to submit reviews for this. I punted most of the prep work to bug 1392886. https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c01f4e1fae19e9e01f350171fad89f292cc185b has some tasks on it. Experimental sparse profiles for flake8, wptlint, and sphinx docs. https://public-artifacts.taskcluster.net/Bv8T1qveSvadVYP1eWXe_w/0/public/logs/live_backing.log shows us going from a wptlint checkout to a flake8 checkout. It takes 15s to perform ~70k file deletes+adds. (Anything involving WPT is about the worst case because of its absurd file count.) https://public-artifacts.taskcluster.net/aLvwHYqAQ0aQ6GONLknG0A/0/public/logs/live_backing.log is more interesting. That appears to be a flake8 checkout transitioning to a docs checkout. That takes ~4s to perform ~1600 file modifications. The real win from these tasks is not doing a full checkout. We're seeing the initial checkout after clone complete in 2-3s instead of upwards of a minute. Since TC's scheduling doesn't have cache affinity, checkouts tend to be more common than you'd think and minimizing the time spent doing checkouts translates to nice wall-time wins.
This would probably be a nice win for my recently-added upload-generated-sources tasks as well: https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/upload-generated-sources/kind.yml They're doing a full checkout but they only really need some Python bits.
Most toolchain jobs have the same property: they do a full checkout for a very small number of files they need.
Comment on attachment 8900410 [details] Bug 1390700 - Sparse checkout profiles for mach and taskgraph; https://reviewboard.mozilla.org/r/171752/#review177000 Once concern I have is that we won't see problems with these profiles locally -- not until a try push or a landing. So probably being generous is good (e.g., including python/ rather than specific subpackages)
Attachment #8900410 - Flags: review?(dustin) → review+
Comment on attachment 8900411 [details] Bug 1390700 - Support sparse checkouts in run-task; https://reviewboard.mozilla.org/r/171754/#review177006 This commit should probably be before the previous.. ::: taskcluster/docs/caches.rst:60 (Diff revision 1) > This cache name pattern is managed by ``run-task`` and must only be > used with ``run-task``. > > +``level-{{level}}-checkouts-sparse-{{hash}}`` > + This is like the above but is used when the checkout is sparse (contains > + a subset of files). Isn't the cache requirements file enough to handle this difference?
Attachment #8900411 - Flags: review?(dustin) → review+
Comment on attachment 8900411 [details] Bug 1390700 - Support sparse checkouts in run-task; https://reviewboard.mozilla.org/r/171754/#review177006 > Isn't the cache requirements file enough to handle this difference? The cache requirements file locks out incompatible tasks. It is a check on us not mixing and matching different cache "flavors" when we shouldn't be. We arguably should have a cache requirement for sparse checkouts. I may do that as a follow-up...
Comment on attachment 8900410 [details] Bug 1390700 - Sparse checkout profiles for mach and taskgraph; https://reviewboard.mozilla.org/r/171752/#review177000 Yes, this is a big danger with sparse checkouts. Another is wildcards not finding files because they aren't in the sparse profile. The latter we can mitigate with a mozpack Finder that is sparse aware. I may have a go at that... As for not uncovering problems locally, I would like end-users to start using sparse checkouts. So we'll see some coverage once that occurs. But that's a ways off. There's no way around it: sparse checkouts introduce a whole new class of bugs due to "missing" files :/
build/sparse-profiles/taskgraph has been bitrotted by bug 1391744, and rebased as-is, I'd expect the decision task to be busted.
Comment on attachment 8900797 [details] Bug 1390700 - Add hash for taskcluster/decision:0.1.10; https://reviewboard.mozilla.org/r/172236/#review177508 versions in comments ++
Attachment #8900797 - Flags: review?(aki) → review+
Attachment #8900412 - Flags: review?(dustin) → review+
Attachment #8900795 - Flags: review?(dustin) → review+
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ade7021e45b5 Fix some error messages in run-task; r=dustin https://hg.mozilla.org/integration/autoland/rev/7f873ef16fba Support sparse checkouts in run-task; r=dustin https://hg.mozilla.org/integration/autoland/rev/d243323c8686 Sparse checkout profiles for mach and taskgraph; r=dustin https://hg.mozilla.org/integration/autoland/rev/9923fffd4f64 Use sparse checkouts for decision task; r=dustin https://hg.mozilla.org/integration/autoland/rev/0516a2ab2945 Documentation for sparse checkouts; r=me
Comment on attachment 8900410 [details] Bug 1390700 - Sparse checkout profiles for mach and taskgraph; https://reviewboard.mozilla.org/r/171752/#review177682 This looks great! ::: build/sparse-profiles/mach:8 (Diff revision 2) > +path:build/autoconf/config.guess > +# Used for bootstrapping the mach driver. > +path:build/mach_bootstrap.py > +path:build/virtualenv_packages.txt > +path:mach > +# Various dependencies. There is room to trim fat, especially in I don't know that I'd bother trying to trim this down--it seems like it's more likely to cause us headaches down the road of implementing commands assuming the full virtualenv is there and then failing.
Attachment #8900410 - Flags: review?(ted) → review+
Had to backout the change to .taskcluster.yml because of this weird failure on some *Windows* builds: 2017-08-25T04:27:54 INFO - Found a OMYU_kr8R3qr7j9FD12T2w fuzzy match with eM27zVtYTXCuQppyOjoT-w ... 2017-08-25T04:27:54 INFO - Verifying the signing:build XQdrW-FuRKex5m9yKKbQZw task definition is part of the signing:decision L4bFr_PJRqavWHhGGogO4w task graph... 2017-08-25T04:27:54 INFO - Found XQdrW-FuRKex5m9yKKbQZw in the graph; it's a match 2017-08-25T04:27:54 INFO - Verifying signing:decision L4bFr_PJRqavWHhGGogO4w command... 2017-08-25T04:27:54 CRITICAL - signing:decision L4bFr_PJRqavWHhGGogO4w illegal option --sparse-profile=build/sparse-profiles/taskgraph in the command! 2017-08-25T04:27:54 CRITICAL - Chain of Trust verification error! Traceback (most recent call last): File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1311, in verify_chain_of_trust task_count = await verify_task_types(chain) File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 1098, in verify_task_types await valid_task_types[task_type](chain, obj) File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 943, in verify_decision_task verify_firefox_decision_command(link) File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 908, in verify_firefox_decision_command raise_on_errors(errors) File "/builds/scriptworker/lib/python3.5/site-packages/scriptworker/cot/verify.py", line 219, in raise_on_errors raise CoTError("\n".join(errors)) I have no clue how to fix this. aki?
Flags: needinfo?(aki)
Keywords: leave-open
Backout by gszorc@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/874e11b5b24e Backed out changeset 9923fffd4f64 for somehow breaking chain of trust on some Windows builds
Tl;dr: we probably need an `if item.startswith('--sparse-profile='): continue` here [1], followed by a new scriptworker release. Longer answer: This [2] is the ugliness I currently use to eyeball a decision task to make sure it *looks* like a valid decision task. Combined with known workerTypes and allowlisted docker images, it restricts the damage someone could do should they have the desire + scopes to create a malicious release graph complete with signing and update-submitting. Bug 1328719 should let us get rid of this function, and instead rebuild the decision task definition using json-e. It will likely allow us to get rid of the docker image allowlists as well. I haven't been able to prioritize time for it, but since bug 1393277 would make a lot of releasetask migration work easier and more elegant, I may be able to carve out time for cot v2. [1] https://github.com/mozilla-releng/scriptworker/blob/b353ce36ff38cdb52d95a3de5d0967bcc8e448f0/scriptworker/cot/verify.py#L889-L890 [2] https://github.com/mozilla-releng/scriptworker/blob/b353ce36ff38cdb52d95a3de5d0967bcc8e448f0/scriptworker/cot/verify.py#L836-L908
Flags: needinfo?(aki)
(In reply to Aki Sasaki [:aki] from comment #28) > Bug 1328719 should let us get rid of this function * Bug 1328719 should allow us to get rid of this function for *on-push* decision tasks. Release task graphs still require us to be able to parameterize certain things like buildnumber; until we have parameterized hooks, we may have to support submitting and cot-verifying non-json-e decision task definitions.
Removing the option from the payload and baking it into the decision image would improve verifiability, as would simplifying the entire commandline.
Pushed out scriptworker 5.0.1: https://hg.mozilla.org/build/puppet/rev/9f9c4770ded2 . This should take ~1hr to propagate.
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bad2c7bdb5c7 Use sparse checkouts for decision task; r=dustin
Keywords: leave-open
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Blocks: 1395752
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: