Closed Bug 1436037 Opened 2 years ago Closed 8 months ago

Support generic-worker in run-task

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(firefox-esr60 fixed, firefox66 fixed)

RESOLVED FIXED
Tracking Status
firefox-esr60 --- fixed
firefox66 --- fixed

People

(Reporter: ahal, Assigned: ahal)

References

(Depends on 1 open bug, Blocks 4 open bugs)

Details

Attachments

(4 files, 2 obsolete files)

The end goal here is to run source-test tasks on OSX and Windows (but OSX is more important). So if I get anything wrong, please modify the bug as appropriate.

Source-test tasks are based on the run_task infrastructure. Currently we have targets for 'docker-worker' and 'native-engine', but no support for 'generic-worker' (which is the implementation used to run OSX and Windows tests).

Despite the fact that this has been an issue for over a year and I'm only a filing a bug now (oops), it should be high prioritiy as we don't have any place to run python-test tasks on OSX (since OSX builds are cross-compiled). We still run python-tests as part of the build on Windows so it's not as important. Though being able to pull the tests out of the build tasks would sure be nice.

Around ~7 months ago, I started a patch but hit some kind of roadblock or ran out of time (I don't remember anymore). I'll push what I had and maybe that will be a good starting point.
Comment on attachment 8948715 [details]
Bug 1436037 - WIP Support generic-worker with run-task

https://reviewboard.mozilla.org/r/218106/#review223938


Code analysis found 2 defects in this patch:
 - 2 defects found by mozlint

You can run this analysis locally with:
 - `./mach lint path/to/file` (JS/Python)


If you see a problem in this automated review, please report it here: http://bit.ly/2y9N9Vx


::: taskcluster/taskgraph/transforms/job/run_task.py:132
(Diff revision 1)
> +
> +    if run.get('cache-dotcache'):
> +        raise Exception("No cache support on generic-worker; can't use cache-dotcache")
> +
> +    run_command = run['command']
> +    if isinstance(run_command, basestring):

Error: Undefined name 'basestring' [flake8: F821]

::: taskcluster/taskgraph/transforms/task.py:358
(Diff revision 1)
>          Optional('retry-exit-status'): Any(
>              int,
>              [int],
>          ),
> +        Optional('retry-exit-status'): int,
> +        Optional('context'): basestring,

Error: Undefined name 'basestring' [flake8: F821]
Comment on attachment 8948715 [details]
Bug 1436037 - WIP Support generic-worker with run-task

https://reviewboard.mozilla.org/r/218106/#review223948


Code analysis found 2 defects in this patch:
 - 2 defects found by mozlint

You can run this analysis locally with:
 - `./mach lint path/to/file` (JS/Python)


If you see a problem in this automated review, please report it here: http://bit.ly/2y9N9Vx


::: taskcluster/taskgraph/transforms/job/run_task.py:132
(Diff revision 2)
> +
> +    if run.get('cache-dotcache'):
> +        raise Exception("No cache support on generic-worker; can't use cache-dotcache")
> +
> +    run_command = run['command']
> +    if isinstance(run_command, basestring):

Error: Undefined name 'basestring' [flake8: F821]

::: taskcluster/taskgraph/transforms/task.py:357
(Diff revision 2)
>          # the exit status code(s) that indicates the task should be retried
>          Optional('retry-exit-status'): Any(
>              int,
>              [int],
>          ),
> +        Optional('context'): basestring,

Error: Undefined name 'basestring' [flake8: F821]
Hm, this is going to get annoying.
I think the next step in my patch was to actually make the 'generic_worker_run_task' function set things up properly for generic-worker (I'm not sure what this entails). We might also need modifications to the run-task script itself, especially if we start supporting Windows.
I'm still poking around at this. Not sure if I'll end up getting this resolved, but don't want anyone else to pick this up and duplicate efforts in the meantime.
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
Thanks Andrew! I will concentrate on getting the mozprocess unit tests running on Windows in the meantime.
I've been making some good progress, but might need some help now. The latest push gets the task running, and am currently working through issues in run-task. The latest error happens because 'hg' is not installed on the generic-worker (I assume we'll also have to figure out how to get 'robustcheckout' installed):
https://public-artifacts.taskcluster.net/EL4hZGyqRLqHnuFZbW3-ZA/0/public/logs/live_backing.log

I'm told this will involve a change to https://github.com/mozilla-releng/OpenCloudConfig , though I don't see any reference to OSX in there.
(p.s, looks like my in-tree fix to get the reviewbot using python2 worked \o/)
OCC is for Windows only.  I thought that's what you meant when you said "generic-worker" (we wouldn't install hg in the generic worker, obviously -- it'd be installed on the host..)

Macs are configured with https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain.  The old mac builders had hg and robustcheckout installed, so doing so for the testers shouldn't be too hard.
Depends on: 1436767
Product: TaskCluster → Firefox Build System
I have a desire to make this work as well. But I'll be aiming for Windows rather than macOS. Same difference (more or less).

Since it looks like there is no activity on this bug, I hope you don't mind me stealing it.
Assignee: ahalberstadt → gps
Blocks: 1350956
Please do! Sorry it got stalled out.
Depends on: 1459737
Blocks: 1464043
Unassigning myself since I'm not working on this right this moment and it looks like ahal has a new need for this.
Assignee: gps → nobody
Status: ASSIGNED → NEW
For the record this isn't blocking my immediate needs and I don't have plans to start working on it again anytime soon. I will be working on bug 1465181 however, which will likely have a little bit of overlap with this one.
See Also: → 1465181
I did look into this a little bit more for OSX, but run-task now requires python 3 and the generic-worker OSX hosts don't have that installed.
Depends on: 1466535
Duplicate of this bug: 1478364
(In reply to Andrew Halberstadt [:ahal] from comment #0)
> The end goal here is to run source-test tasks on OSX and Windows (but OSX is
> more important).

Note this is needed for linux too now that we are migrating these source-test tasks from taskcluster-worker to generic-worker.

I figure it makes sense to tackle all three platforms (windows, linux, macOS) at once rather than split into separate bugs, as the code to generate the task definition should be generic (with the noted exception that the format of task.payload.command is platform-specific).
See Also: → 1478364
See Also: 1478364
Comment on attachment 8948715 [details]
Bug 1436037 - WIP Support generic-worker with run-task

https://reviewboard.mozilla.org/r/218106/#review266358

This looks like a great start, thanks!

::: taskcluster/taskgraph/transforms/job/common.py:84
(Diff revision 3)
>      reserved for ``run-task`` tasks.
>      """
>      level = config.params['level']
>  
> -    # native-engine does not support caches (yet), so we just do a full clone
> -    # every time :(
> +    # native-engine and generic-worker do not support caches (yet), so we just
> +    # do a full clone every time :(

generic-worker does support caches, see "Writable Directory Cache" in https://docs.taskcluster.net/docs/reference/workers/generic-worker/docs/payload

::: taskcluster/taskgraph/transforms/job/mach.py:29
(Diff revision 3)
>  
>  
>  @run_job_using("docker-worker", "mach", schema=mach_schema, defaults={'comm-checkout': False})
>  @run_job_using("native-engine", "mach", schema=mach_schema, defaults={'comm-checkout': False})
> +@run_job_using("generic-worker", "mach", schema=mach_schema, defaults={'comm-checkout': False})
>  def docker_worker_mach(config, job, taskdesc):

should this function be renamed?

::: taskcluster/taskgraph/transforms/job/run_task.py:121
(Diff revision 3)
> +    worker = taskdesc['worker'] = job['worker']
> +    command = ['./run-task']
> +    common_setup(config, job, taskdesc, command)
> +
> +    if run.get('cache-dotcache'):
> +        raise Exception("No cache support on generic-worker; can't use cache-dotcache")

Caches are supported - see https://docs.taskcluster.net/docs/reference/workers/generic-worker/docs/payload

::: tools/tryselect/tasks.py:72
(Diff revision 3)
>          params.check()
>      except ParameterMismatch as e:
>          print(PARAMETER_MISMATCH.format(e.args[0]))
>          sys.exit(1)
>  
> -    taskgraph.fast = True
> +    taskgraph.fast = False

What does this boolean change do? Could you add a comment? Thanks!
I have patch series at https://hg.mozilla.org/users/gszorc_mozilla.com/firefox/rev/2d16aac7 and https://hg.mozilla.org/users/gszorc_mozilla.com/firefox/rev/a5484c1e that may be useful in implementing this functionality. I have no immediate plans to resume that work.
Comment on attachment 8948715 [details]
Bug 1436037 - WIP Support generic-worker with run-task

https://reviewboard.mozilla.org/r/218106/#review266358

> What does this boolean change do? Could you add a comment? Thanks!

Ah, this was just to make things easier for me to debug with |mach try| (we skip schema validation in `fast` mode which causes strange errors). This hunk should be reverted before landing.
Blocks: 1500104
This might have just been fixed as part of bug 1474570. I'll try and use the work in there to stand up some OSX python-test tasks and see if I run into any problems. If not, we can resolve this.
Depends on: 1474570
I have patches that enable Windows generic-worker tasks to work with run-task:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3b3149fec49ce3e34add364dbeba94dd8ed6ba3b

However I created a bit of collateral damage around toolchains and caches. Reverting the change that caused this would be easy, but I have a feeling the existing behaviour is wrong and want to get some feedback before continuing.

Because of support_vcs_checkout, we mount a hardcoded cache at /builds/worker/checkouts:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/common.py#87

This is where most tasks checkout Gecko. However some tasks (I think just toolchains) checkout Gecko to /builds/worker/workspace/build instead:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/toolchain.py#135

I believe this mismatch between cache and checkout dir is a bug, but just want to verify it isn't expected. Assuming it is a bug, is there any reason not to change toolchains to use /builds/worker/checkouts like everything else?
Flags: needinfo?(gps)
(In reply to Andrew Halberstadt [:ahal] from comment #26)
> Because of support_vcs_checkout, we mount a hardcoded cache at
> /builds/worker/checkouts:
> https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/
> transforms/job/common.py#87
> 
> This is where most tasks checkout Gecko. However some tasks (I think just
> toolchains) checkout Gecko to /builds/worker/workspace/build instead:

All builds also use that directory (among others):
https://searchfox.org/mozilla-central/search?q=%7Bworkdir%7D%2Fworkspace%2Fbuild&path=taskcluster
and that that directory is also cached:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/common.py#22-33

As I understand it, the original intention was that things in /builds/worker/checkouts were supposed to not modify the checkout directory, where as ones at /builds/worker/workspace were allowed to. Given that we always clobber the checkout anyway, perhaps this distinction isn't valuable anymore.

I seem to recall :gps expressing concern about having the hg-share directory (/builds/worker/checkouts/hg-share) on a different filesystem than the checkout, which occurs in the /builds/worker/workspace case, so it might be worth changing things on that basis.
Ok I'll keep the needinfo to let gps chime in if he wants, but sounds like if there is work to do here it would be better left to a follow-up anyway. I'll refactor my patch to try and preserve the existing behaviour.
Also looks like toolchains never call that method so aren't using the workspace cache.
Also also they pass in sparse=True:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/toolchain.py#132

This value only gets used to build the checkouts cache, so that sort of implies the intent is that they *are* supposed to be using that cache.
This enables Windows generic-worker based tasks to use the run-task script.

MozReview-Commit-ID: C07FANaYzf7
The following python-test paths are being moved out of 'make check' and into their own task:
- python/mozlint
- testing/mozbase
- tools/lint

The following python-test paths previously did not run on Windows:
- python/mozterm
- testing/marionette
- testing/raptor
- tools/tryselect

MozReview-Commit-ID: C07FANaYzf7

Depends on D10758
Assignee: nobody → ahal
Status: NEW → ASSIGNED
Depends on: 1503756
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e072757bf691
[taskgraph] Support Windows generic-worker with run-task, r=gps
https://hg.mozilla.org/integration/autoland/rev/914a7a899dd1
[python] Create Windows python-test tasks, r=gps
Have an OSX patch nearly ready to go, going to re-use this bug.
Flags: needinfo?(gps)
Keywords: leave-open
Also, this is blocked by bug 1503756 relanding.
(In reply to Mike Hommey [:glandium] from comment #37)
> Also, this is blocked by bug 1503756 relanding.

Sorry, forgot about that!
Flags: needinfo?(ahal)
This is relevant to my interests (the windows support more urgently, but also macOS). If this is stalled due to lack of developer time let me know and I can help out.
Blocks: 1508828
The Windows patch is finished, reviewed and ready to land (and I have the OSX patch working locally, but haven't polished it up for review yet). Unfortunately it's blocked on bug 1503756.

Basically there's a bug in the docker image generation such that when an image is re-built, a bunch of system changes get pulled in. In that bug it was discovered that some system changes had caused a mochitest to start failing as well as cause the emulator on Android to fail at startup. So any modification to a file that would cause an image re-build (like run-task in this case), starts failing tests.

So one of two things need to happen:
1) Fix our image generation to not pull in random changes (ideal)
2) Figure out why the emulator stops working with those changes (band-aid, but at least unblocks this)

There is *a lot* of stuff blocked on that other bug and it looks like :jmaher is trying to ring some bells. Feel free to chime in over there to further bump up the priority.
Thanks for the quick summary! I'll see if I can make any progress on that bug as my background task.
This is now unblocked from landing. However, the patch appears to be bit rotted with merge conflicts. Could you please rebase?
Flags: needinfo?(ahal)
Thanks for unblocking this!

Rebased and try run in progress here:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb20c8898ff855a6eced924e433224412592c5b4
Flags: needinfo?(ahal)
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/df5747b8931b
[taskgraph] Support Windows generic-worker with run-task, r=gps
https://hg.mozilla.org/integration/autoland/rev/c88d2cb951ca
[python] Create Windows python-test tasks, r=gps
This is staying open for OSX support, which I had working awhile back. Just need to rebase, test and get it ready for review.
Depends on: 1507488
This isn't a hard blocker, but will make it easier and there is a patch uploaded. So might as well wait.
Depends on: 1512676
This enables OSX generic-worker based tasks to use the run-task script.
Attachment #8948715 - Attachment is obsolete: true
I broke the jsshell-bench tasks, will need to investigate further tomorrow.
Attachment #9032260 - Attachment description: Bug 1436037 - [taskgraph] Support OSX generic-worker in run-task, r?gps → Bug 1436037 - [taskgraph] Support OSX generic-worker in run-task, r?Callek
Keywords: leave-open
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6ffe1e52008f
[taskgraph] Support OSX generic-worker in run-task, r=Callek
https://hg.mozilla.org/integration/autoland/rev/0b74d496b797
[ci] Run mozbase and mozlint python-test tasks on OSX, r=jmaher
https://hg.mozilla.org/mozilla-central/rev/6ffe1e52008f
https://hg.mozilla.org/mozilla-central/rev/0b74d496b797
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Blocks: 1525373
Blocks: 1528438
Blocks: 1543894
Attachment #9047285 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.