Closed Bug 1391789 Opened 2 years ago Closed 2 years ago

More robust cache usage and validation

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla57

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(7 files, 1 obsolete file)

nalexander ran into some problems with permissions in caches.

Our current strategy for managing caches is a bit brittle. I have some patches that will improve matters.
I forgot to mention in any commit messages that the strategy I used is arguably better implemented at the TC worker layer. We could do things like give the worker paths in the Docker image to hash along with an optional blob to determine the cache key. Or maybe there could be some "protocol" where a command is executed in the Docker contain before the main task to determine the unique cache "namespace". This would be nice to have and would eliminate things like the UID/GID mismatch, which we only detect today *after* a cache is mounted. But we can mostly solve the problem in run-task today. If nothing else, we'll prove out the concept to justify it as a future TC platform/worker feature.
Comment on attachment 8899041 [details]
Bug 1391789 - Improve cache coherence via run-task integration;

https://reviewboard.mozilla.org/r/170364/#review175588

::: taskcluster/taskgraph/transforms/task.py:669
(Diff revision 1)
> +        # run-task knows how to validate caches.
> +        #
> +        # To help ensure new run-task features and bug fixes don't interfere
> +        # with existing caches, we seed the hash of run-task into cache names.
> +        # So, any time run-task changes, we should get a fresh set of caches.
> +        # This means run-task can make changes to cache interaction at any time
> +        # without regards for backwards or future compatibility.
> +
> +        run_task = payload.get('command', [''])[0].endswith('run-task')

A more generic implementation of this feature would be "get cache namespace for a task." You could imagine feeding a number of things into the function that resolves that value:

* File content from repo (as we did here)
* File content from the actual Docker image (not available to Taskgraph, but if this were executed on the worker at task time it would be)
* Data from Docker container (such as the numeric UID/GID that a username/group resolves to)
* A TC secret (so you could force usage of new caches at any time, independent of version control changes)

And of course, there's no reason this implementation has to be tied to `run-task`. I'm not opposed to abstracting it now and having it only return a value for `run-task` tasks. Let me know.
Comment on attachment 8899037 [details]
Bug 1391789 - Consolidate tooltool modifications to shared function;

https://reviewboard.mozilla.org/r/170356/#review175830

::: commit-message-8febd:1
(Diff revision 1)
> +Bug 1391789 - Consolidate tooltool modifications to shared funciton; r?dustin

function
Attachment #8899037 - Flags: review?(dustin) → review+
Comment on attachment 8899038 [details]
Bug 1391789 - Make tooltool cache level dependent;

https://reviewboard.mozilla.org/r/170358/#review175834

Most stuff using tooltool is on a different worker per level anyway, so this shouldn't result in much duplication of objects.
Attachment #8899038 - Flags: review?(dustin) → review+
Comment on attachment 8899039 [details]
Bug 1391789 - Make hash_path() a "public" API;

https://reviewboard.mozilla.org/r/170360/#review175838

::: taskcluster/taskgraph/util/hash.py:13
(Diff revision 1)
>  import mozpack.path as mozpath
>  import hashlib
>  
>  
>  @memoize
> -def _hash_path(path):
> +def hash_path(path):

A docstring would be good, then..
Attachment #8899039 - Flags: review?(dustin) → review+
Comment on attachment 8899040 [details]
Bug 1391789 - Extract permission setting to own functions;

https://reviewboard.mozilla.org/r/170362/#review175840

::: commit-message-c666e:5
(Diff revision 1)
> +Bug 1391789 - Extract permission setting to own functions; r?dustin
> +
> +I must have been in a closure mood when I wrote this code. The
> +main function is getting a bit heavyweight. So let's extract
> +these closures to make things less dense.

For what it's worth I rather like that style of Python
Attachment #8899040 - Flags: review?(dustin) → review+
Comment on attachment 8899041 [details]
Bug 1391789 - Improve cache coherence via run-task integration;

https://reviewboard.mozilla.org/r/170364/#review175842

I'll admit I think this is a big hammer.  But, it's relatively confined and well-implemented, and maybe I'm wrong about the relative hammer/problem size :)

::: commit-message-eb137:18
(Diff revision 1)
> +that some things require fresh caches. When mistakes are made, tasks
> +break intermittently due to cache wonkiness.
> +
> +One area where we get into trouble is with UID and GID mismatch.
> +Task A will use a Docker image where our standard "worker" user/group
> +is UID/GID 1000:1000. Then Task B will use UID/GID 500:500. If they

..worth noting this is usually due to Ubuntu vs. RHEL.

::: commit-message-eb137:38
(Diff revision 1)
> +
> +1) Cache names tied to run-task content
> +2) Cache validation in run-task
> +
> +Taskgraph now detects when a task is using caches with run-task. When
> +both conditions are true, the cache name is adjusted to contain a

both conditions = using caches & using run-task?

::: taskcluster/docker/recipes/run-task:310
(Diff revision 1)
> +
> +        # The cache has content and we have a requirements file. Validate
> +        # requirements alignment.
> +        elif os.path.exists(requires_path):
> +            with open(requires_path, 'rb') as fh:
> +                wanted_requirements = set(fh.read().splitlines())

you might want to sort this?

::: taskcluster/docker/recipes/run-task:319
(Diff revision 1)
> +                    print('requirements for populated cache %s differ from '
> +                          'this task' % cache)
> +                    print('cache requirements: %s' % ' '.join(sorted(
> +                        wanted_requirements)))
> +                    print('our requirements:   %s' % ' '.join(sorted(
> +                        our_requirements)))

consider using difflib

::: taskcluster/docs/caches.rst:12
(Diff revision 1)
>  document them and their appropriate use.
>  
> +Caches are essentially isolated filesystems that are persisted between
> +tasks. For example, if 2 tasks run on a worker - one after the other -
> +and both tasks request the same cache, the subsequent task will be
> +able to see files in the cache that were created by the first task.

It's worth noting in the same breath that caches are not shared between concurrently-running tasks.

::: taskcluster/docs/caches.rst:86
(Diff revision 1)
>     dependent on the task.
>  
>  Other
> -=====
> +-----
>  
> -``level-{{level}}-tooltool-cache
> +``level-{{level}}-tooltool-cache-{{hash}}

this could probably omit `-cache` just to safe some filename length

::: taskcluster/taskgraph/transforms/task.py:33
(Diff revision 1)
>  
>  
> +RUN_TASK = os.path.join(GECKO, 'taskcluster', 'docker', 'recipes', 'run-task')
> +
> +
> +def _run_task_suffix():

@memoize

::: taskcluster/taskgraph/transforms/task.py:669
(Diff revision 1)
> +        # run-task knows how to validate caches.
> +        #
> +        # To help ensure new run-task features and bug fixes don't interfere
> +        # with existing caches, we seed the hash of run-task into cache names.
> +        # So, any time run-task changes, we should get a fresh set of caches.
> +        # This means run-task can make changes to cache interaction at any time
> +        # without regards for backwards or future compatibility.
> +
> +        run_task = payload.get('command', [''])[0].endswith('run-task')

I feel like this solution is already highly generic.  Aside from the uid/gid and potentially a version associated with robustcheckout, I don't see any additional uses for this feature.

I think the major risk that this system does not address is corrupt caches.  Robustcheckout and tooltool are both well-designed with regard to robustness, but we've seen issues now and then.
Attachment #8899041 - Flags: review?(dustin) → review+
Comment on attachment 8899042 [details]
Bug 1391789 - Validate certain caches are used with run-task;

https://reviewboard.mozilla.org/r/170366/#review175858
Attachment #8899042 - Flags: review?(dustin) → review+
Attachment #8899043 - Flags: review?(dustin) → review+
Comment on attachment 8899044 [details]
Bug 1391789 - Add UID and GID to cache parameters;

https://reviewboard.mozilla.org/r/170370/#review175864
Attachment #8899044 - Flags: review?(dustin) → review+
Comment on attachment 8899042 [details]
Bug 1391789 - Validate certain caches are used with run-task;

https://reviewboard.mozilla.org/r/170366/#review175858
Comment on attachment 8899041 [details]
Bug 1391789 - Improve cache coherence via run-task integration;

https://reviewboard.mozilla.org/r/170364/#review175842

> you might want to sort this?

Eh? This is a set, not a list and all subsequent operations treat it as such. We do use a sorted() around output, so that behavior is deterministic.

> consider using difflib

It's probaby easier to use set algebra, which this does.
Attachment #8899044 - Attachment is obsolete: true
Blocks: 1391476
We're sorry, Autoland could not rebase your commits for you automatically. Please manually rebase your commits and try again.

hg error in cmd: hg rebase -s fa708f732f38 -d 26914436ea95: rebasing 414930:fa708f732f38 "Bug 1391789 - Consolidate tooltool modifications to shared function; r=dustin"
merging taskcluster/taskgraph/transforms/job/common.py
merging taskcluster/taskgraph/transforms/job/mozharness.py
merging taskcluster/taskgraph/transforms/job/spidermonkey.py
warning: conflicts while merging taskcluster/taskgraph/transforms/job/spidermonkey.py! (edit, then use 'hg resolve --mark')
unresolved conflicts (see hg resolve, then hg rebase --continue)
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/265873cf1388
Consolidate tooltool modifications to shared function; r=dustin
https://hg.mozilla.org/integration/autoland/rev/558568f61ccc
Make tooltool cache level dependent; r=dustin
https://hg.mozilla.org/integration/autoland/rev/9c47141cd9a1
Make hash_path() a "public" API; r=dustin
https://hg.mozilla.org/integration/autoland/rev/6225f34dfee2
Extract permission setting to own functions; r=dustin
https://hg.mozilla.org/integration/autoland/rev/2b282e22f50e
Improve cache coherence via run-task integration; r=dustin
https://hg.mozilla.org/integration/autoland/rev/30c99e1d6eac
Validate certain caches are used with run-task; r=dustin
https://hg.mozilla.org/integration/autoland/rev/aa0700e8a3c8
Stop versioning version control cache; r=dustin
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e0a80a925217
Set relengapi-proxy on correct object; r=bustage
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.