Migrate linux talos workers from taskcluster-worker to generic-worker

RESOLVED FIXED

Status

task
RESOLVED FIXED
10 months ago
a month ago

People

(Reporter: pmoore, Assigned: dragrom)

Tracking

(Depends on 1 bug, Blocks 1 bug)

Production
Dependency tree / graph

Details

Attachments

(8 attachments, 10 obsolete attachments)

55 bytes, patch
dhouse
: review+
pmoore
: review+
dividehex
: review-
dragrom
: checked-in+
Details | Diff | Splinter Review
3.21 KB, patch
pmoore
: review+
dragrom
: checked-in+
Details | Diff | Splinter Review
55 bytes, text/x-github-pull-request
dhouse
: review+
dividehex
: review-
dragrom
: checked-in+
Details | Review
3.21 KB, patch
pmoore
: review+
dragrom
: checked-in+
Details | Diff | Splinter Review
55 bytes, text/x-github-pull-request
dhouse
: review+
dragrom
: checked-in+
Details | Review
13.83 KB, patch
pmoore
: review+
dragrom
: checked-in+
Details | Diff | Splinter Review
55 bytes, text/x-github-pull-request
dragrom
: checked-in+
Details | Review
55 bytes, text/x-github-pull-request
dragrom
: checked-in+
Details | Review
(Reporter)

Description

10 months ago
In gecko CI, taskcluster-worker is currently only used by linux talos. Since our primary maintainer of taskcluster-worker has left Mozilla, the taskcluster team has decided to focus its effort on generic-worker, and to that end we intend to discontinue support for taskcluster-worker, supporting a migration path to generic-worker from exiting taskcluster-worker tasks.

A generic-worker puppet module already exists and is used for production mac workers, but has not been tested on linux.

Also the task payload varies marginally between generic-worker and taskcluster-worker, so a change to the task generation in gecko will be needed to mark these tasks as having a generic-worker payload format rather than a taskcluster-worker native engine payload format.

It would be best for us to first migrate staging, and then migrate production.

Taskcluster team are available and keen to help with the migration effort.
Duplicate of this bug: 1384579
(Assignee)

Updated

10 months ago
Assignee: relops → dcrisan
Status: NEW → ASSIGNED
(Assignee)

Updated

10 months ago
Depends on: 1474568
(Assignee)

Comment 2

10 months ago
Added generic-worker, taskcluster-proxy and uarantine-worker linux packages to repository

-rw-r--r-- 1 puppetsync puppetsync 13808052 2018-07-13 05:59 generic-worker-v10.8.4-linux-amd64
-rw-r--r-- 1 puppetsync puppetsync  6785240 2018-07-13 05:59 quarantine-worker-v1.0.0-linux-amd64
-rw-r--r-- 1 puppetsync puppetsync  7549872 2018-07-13 05:59 taskcluster-proxy-v4.1.1-linux-amd64
No longer depends on: 1474568
(Assignee)

Comment 3

10 months ago
Added Ubuntu 16 capabilities to generic-worker module.
Modified generic-worker to automatically create configuration file for Linux and OSX workers
Added t-linux64-ms-280.test.releng.mdc1.mozilla.com to staging pool
Enabled http_proxy module
Changed bugzilla-utils and run-generic-worker to work on booth OSX and Linux OS
Make control_bug generic, to work on booth OS
Attachment #8992963 - Flags: review?(pmoore)
Attachment #8992963 - Flags: review?(jwatkins)
Attachment #8992963 - Flags: review?(dhouse)

Updated

10 months ago
Attachment #8992963 - Flags: review?(dhouse) → review+
Comment on attachment 8992963 [details] [diff] [review]
Migrate linux talos workers from task cluster to generic worker

See PR
Attachment #8992963 - Flags: review?(jwatkins) → review-
(Assignee)

Comment 5

10 months ago
I made changes required by reviewers.
(Assignee)

Comment 6

10 months ago
Attachment #8994778 - Flags: review?(pmoore)
(Reporter)

Comment 7

10 months ago
Comment on attachment 8994778 [details] [diff] [review]
run_generic_worker_tasks_on_linux.patch

Review of attachment 8994778 [details] [diff] [review]:
-----------------------------------------------------------------

I believe you need to change this line from 'native-engine' to 'generic-worker':

https://dxr.mozilla.org/mozilla-central/rev/085cdfb90903d4985f0de1dc7786522d9fb45596/taskcluster/taskgraph/util/workertypes.py#39

My mistake when I suggested changing the other lines - apologies.

::: taskcluster/taskgraph/transforms/job/mach.py
@@ -26,5 @@
>  })
>  
>  
>  @run_job_using("docker-worker", "mach", schema=mach_schema, defaults={'comm-checkout': False})
> -@run_job_using("native-engine", "mach", schema=mach_schema, defaults={'comm-checkout': False})

My mistake - this line can stay as it is.

::: taskcluster/taskgraph/transforms/job/mozharness_test.py
@@ -305,5 @@
>              mh_command_task_ref
>          ]
>  
>  
> -@run_job_using('native-engine', 'mozharness-test', schema=mozharness_test_run_schema)

My mistake - this line can stay as it is.

::: taskcluster/taskgraph/transforms/job/run_task.py
@@ -106,5 @@
>      command.extend(run_command)
>      worker['command'] = command
>  
>  
> -@run_job_using("native-engine", "run-task", schema=run_task_schema, defaults=defaults)

My mistake - this line can stay as it is.
Attachment #8994778 - Flags: review?(pmoore) → review-
(Reporter)

Comment 8

10 months ago
Comment on attachment 8992963 [details] [diff] [review]
Migrate linux talos workers from task cluster to generic worker

Review of attachment 8992963 [details] [diff] [review]:
-----------------------------------------------------------------

Added comments in PR.
Attachment #8992963 - Attachment is patch: true
Attachment #8992963 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8992963 - Flags: review?(pmoore) → review+
(Assignee)

Updated

10 months ago
Depends on: 1478364
(Reporter)

Comment 9

10 months ago
See https://bugzilla.mozilla.org/show_bug.cgi?id=1478364#c3 - the remaining steps seem to be:

1) fixing test-linux.sh to not assume task directory is home directory (see bug 1478364 for details)
2) adding the run-task implementation for generic-worker

Details in the bug.
(Assignee)

Comment 10

9 months ago
Attachment #9003525 - Flags: feedback?(pmoore)
(Assignee)

Comment 11

9 months ago
Run generic worker on linux machines
Attachment #8994778 - Attachment is obsolete: true
Attachment #9003525 - Attachment is obsolete: true
Attachment #9003525 - Flags: feedback?(pmoore)
Attachment #9004588 - Flags: review?(pmoore)
I retriggered more tasks:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d26ead1312feda1551728ee3264a9d1fc3c478a8&filter-tier=1&filter-tier=2&filter-tier=3

It will take a while for them to complete, but this will give me a full picture.
(Reporter)

Comment 16

9 months ago
Comment on attachment 9004588 [details] [diff] [review]
run_generic_worker_on_linux.patch

Review of attachment 9004588 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great!

::: taskcluster/scripts/tester/test-linux.sh
@@ +27,5 @@
>  : NEED_WINDOW_MANAGER           ${NEED_WINDOW_MANAGER:=false}
>  : NEED_PULSEAUDIO               ${NEED_PULSEAUDIO:=false}
>  : START_VNC                     ${START_VNC:=false}
>  : TASKCLUSTER_INTERACTIVE       ${TASKCLUSTER_INTERACTIVE:=false}
> +: WORKSPACE                     ${WORKSPACE:=$PWD/workspace}

nit: ${WORKSPACE:=workspace}

@@ +57,5 @@
>  if [[ -z ${MOZHARNESS_CONFIG} ]]; then fail "MOZHARNESS_CONFIG is not set"; fi
>  
>  # make sure artifact directories exist
>  mkdir -p $WORKSPACE/build/upload/logs
> +mkdir -p $PWD/artifacts/public

nit: mkdir -p artifacts/public

@@ +64,5 @@
>  cleanup() {
>      local rv=$?
>      if [[ -s $HOME/.xsession-errors ]]; then
>        # To share X issues
> +      cp $HOME/.xsession-errors $PWD/artifacts/public/xsession-errors.log

nit: cp ~/.xsession-errors artifacts/public/xsession-errors.log

@@ +123,5 @@
>      start_xvfb '1600x1200x24' 0
>  fi
>  
>  if $START_VNC; then
> +    x11vnc > $PWD/artifacts/public/x11vnc.log 2>&1 &

nit: x11vnc > artifacts/public/x11vnc.log 2>&1 &

@@ +191,5 @@
>  
>  # Run a custom mach command (this is typically used by action tasks to run
>  # harnesses in a particular way)
>  if [ "$CUSTOM_MACH_COMMAND" ]; then
> +    eval "$PWD/workspace/build/tests/mach ${CUSTOM_MACH_COMMAND}"

nit: eval "workspace/build/tests/mach ${CUSTOM_MACH_COMMAND}"

Also note, the if statement above should really be:

if [ -n "$CUSTOM_MACH_COMMAND" ]; then

I realise you didn't write the if clause, but maybe we can fix that at the same time. :)

::: taskcluster/taskgraph/transforms/job/mozharness_test.py
@@ +208,5 @@
>  
>      taskdesc['scopes'].extend(
> +        ['generic-worker:os-group:{}/{}'.format(
> +            job['worker-type'],
> +            group

Is this a whitespace change (e.g. tabs -> spaces)? I can't see what has changed here...

@@ +259,5 @@
> +        mh_command = [
> +            'python2.7',
> +            '-u',
> +            'mozharness/scripts/' + mozharness['script']
> +        ]

I think this is the same as is_macosx section - is there a reason not to combine it with macosx section?

e.g.

    if is_windows:
        mh_command = [
            'c:\\mozilla-build\\python\\python.exe',
            '-u',
            'mozharness\\scripts\\' + normpath(mozharness['script'])
        ]
    else:
        mh_command = [
            'python2.7',
            '-u',
            'mozharness/scripts/' + mozharness['script']
        ]

::: taskcluster/taskgraph/transforms/job/run_task.py
@@ +92,5 @@
>      command.extend(run_command)
>      worker['command'] = command
>  
>  
> +@run_job_using("native-engine", "run-task", schema=run_task_schema, defaults=docker_defaults)

I think nothing is using native-engine - we could probably delete all code for it in a future bug...

@@ +114,5 @@
>      command.extend(run_command)
>      worker['command'] = command
> +
> +
> +@run_job_using("generic-worker", "run-task", schema=run_task_schema, defaults=docker_defaults)

if docker_defaults contains data that is relevant for generic worker / native worker, we should probably rename it to something like worker_defaults.

@@ +122,5 @@
> +    command = ['./run-task']
> +    common_setup(config, job, taskdesc, command)
> +
> +    if run.get('cache-dotcache'):
> +        raise Exception("No cache support on generic-worker; can't use cache-dotcache")

generic-worker supports caches - what sets cache-dotcache? Is this still used? If it is used, we should implement using mounts feature[1] but if not, we should remove the code.

--

[1] https://docs.taskcluster.net/docs/reference/workers/generic-worker/docs/payload

::: taskcluster/taskgraph/transforms/task.py
@@ +1613,5 @@
> +            task['worker-type'] = find_replace_dict[task['worker-type']]
> +        yield task
> +
> +
> +@transforms.add

Please remove this added code when you land - this is only for the try push to run on beta worker types.
Attachment #9004588 - Flags: review?(pmoore)
(Reporter)

Comment 17

9 months ago
(In reply to Pete Moore [:pmoore][:pete] from comment #16)

> @@ +122,5 @@
> > +    command = ['./run-task']
> > +    common_setup(config, job, taskdesc, command)
> > +
> > +    if run.get('cache-dotcache'):
> > +        raise Exception("No cache support on generic-worker; can't use cache-dotcache")
> 
> generic-worker supports caches - what sets cache-dotcache? Is this still
> used? If it is used, we should implement using mounts feature[1] but if not,
> we should remove the code.

See https://dxr.mozilla.org/mozilla-central/rev/9c13dbdf4cc9baf98881b4e2374363587fb017b7/taskcluster/taskgraph/transforms/task.py#332-369

You just need to set cacheName and directory...
(Reporter)

Comment 18

9 months ago
> You just need to set cacheName and directory...

Sorry, "cache-name" and "directory"
(Reporter)

Comment 19

9 months ago
Dragos,

What is the strategy for rolling this out to all trees? I see a few different options:

1) We create a different worker type for generic-worker, and support jobs running in both taskcluster-worker and generic-worker for some time, until this patch has ridden the trains, and decommission the taskcluster-worker worker type when it is no longer used

2) We migrate the workers at the same time as landing the gecko patch, and we uplift the patch to all trees/support branches etc in one shot. There may be fallout from try pushes, but we accept that, and send out emails to warn users.

3) We set these tasks to tier 3 on branches still using taskcluster-worker payload format, and allow them to fail naturally, then raise tier as this rolls out. Not sure what tier these tasks are currently - if they are already tier 3 - this option seems the best.


My preference would be option 3, followed by option 1, followed by option 2.

Joel, thoughts?
Flags: needinfo?(dcrisan)
(Reporter)

Comment 20

9 months ago
(In reply to Pete Moore [:pmoore][:pete] from comment #19)
> Joel, thoughts?
I like option #1- we can uplift things to beta, etc. and allow things to migrate faster.  I am still waiting on more results from the extra jobs I scheduled/retriggered, 162 remaining- maybe another 7 hours
I like option #1- we can uplift things to beta, etc. and allow things to migrate faster.  I am still waiting on more results from the extra jobs I scheduled/retriggered.

I found a few jobs that didn't pass (which we didn't test last time).  These are newer jobs which we have added on hardware in the last month:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d26ead1312feda1551728ee3264a9d1fc3c478a8&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=196400140

the js-bench tests (6speed, webtool, ares6, sunspider) are failing with errors like like:
taskcluster:error] [mounts] Could not fetch from url https://hg.mozilla.org/try/raw-file/d26ead1312feda1551728ee3264a9d1fc3c478a8/taskcluster/docker/recipes/run-task into file /home/cltbld/downloads/eORMHexVTpCZ694cMGEwQQ: (Permanent) HTTP response code 404

I don't know if that is related to the new worker, or if this is something quirky with try server.  As these jobs normally run on try, I assume this is new worker related.

All of the other failures are of no concern and are known issues or not scheduled by default (anymore)
Flags: needinfo?(jmaher)
(Reporter)

Comment 23

9 months ago
(In reply to Joel Maher ( :jmaher PTO - back Sep 4 ) (UTC-4) from comment #22)
> I like option #1- we can uplift things to beta, etc. and allow things to
> migrate faster.  I am still waiting on more results from the extra jobs I
> scheduled/retriggered.
> 
> I found a few jobs that didn't pass (which we didn't test last time).  These
> are newer jobs which we have added on hardware in the last month:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=d26ead1312feda1551728ee3264a9d1fc3c478a8&filter-
> tier=1&filter-tier=2&filter-tier=3&selectedJob=196400140
> 
> the js-bench tests (6speed, webtool, ares6, sunspider) are failing with
> errors like like:
> taskcluster:error] [mounts] Could not fetch from url
> https://hg.mozilla.org/try/raw-file/d26ead1312feda1551728ee3264a9d1fc3c478a8/
> taskcluster/docker/recipes/run-task into file
> /home/cltbld/downloads/eORMHexVTpCZ694cMGEwQQ: (Permanent) HTTP response
> code 404
> 
> I don't know if that is related to the new worker, or if this is something
> quirky with try server.  As these jobs normally run on try, I assume this is
> new worker related.
> 
> All of the other failures are of no concern and are known issues or not
> scheduled by default (anymore)

It looks like the task is looking for a file called run-task in this directory of the hg repo:

  https://hg.mozilla.org/try/raw-file/d26ead1312feda1551728ee3264a9d1fc3c478a8/taskcluster/docker/recipes

However, the correct directory appears to be this one:

  https://hg.mozilla.org/try/raw-file/d26ead1312feda1551728ee3264a9d1fc3c478a8/taskcluster/scripts

This suggests a problem with the task rather than with the worker. Joel, do you know who could help with this? Thanks.
Flags: needinfo?(jmaher)
:ahal, could you help out with this issue- what we see is that the js-shell tests are failing while trying to run on a generic worker instead of the other worker they have been using on hardware.  I know these tasks still run successfully on our integration branches, so maybe this is just figuring out a config or environment setting?
Flags: needinfo?(jmaher) → needinfo?(ahal)
Pete, could you try removing this line?
https://searchfox.org/mozilla-central/source/taskcluster/ci/source-test/jsshell.yml#24

The native-engine workers (which are associated with the hardware pool) still run out of /home/cltbld. If unspecified, that 'workdir' parameter will default to /build/worker.
Flags: needinfo?(ahal) → needinfo?(pmoore)
(or alternatively tell me how to test this out)
(Assignee)

Comment 27

9 months ago
(In reply to Pete Moore [:pmoore][:pete] from comment #23)
> (In reply to Joel Maher ( :jmaher PTO - back Sep 4 ) (UTC-4) from comment
> #22)
> > I like option #1- we can uplift things to beta, etc. and allow things to
> > migrate faster.  I am still waiting on more results from the extra jobs I
> > scheduled/retriggered.
> > 
> > I found a few jobs that didn't pass (which we didn't test last time).  These
> > are newer jobs which we have added on hardware in the last month:
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=d26ead1312feda1551728ee3264a9d1fc3c478a8&filter-
> > tier=1&filter-tier=2&filter-tier=3&selectedJob=196400140
> > 
> > the js-bench tests (6speed, webtool, ares6, sunspider) are failing with
> > errors like like:
> > taskcluster:error] [mounts] Could not fetch from url
> > https://hg.mozilla.org/try/raw-file/d26ead1312feda1551728ee3264a9d1fc3c478a8/
> > taskcluster/docker/recipes/run-task into file
> > /home/cltbld/downloads/eORMHexVTpCZ694cMGEwQQ: (Permanent) HTTP response
> > code 404
> > 
> > I don't know if that is related to the new worker, or if this is something
> > quirky with try server.  As these jobs normally run on try, I assume this is
> > new worker related.
> > 
> > All of the other failures are of no concern and are known issues or not
> > scheduled by default (anymore)
> 
> It looks like the task is looking for a file called run-task in this
> directory of the hg repo:
> 
>  
> https://hg.mozilla.org/try/raw-file/d26ead1312feda1551728ee3264a9d1fc3c478a8/
> taskcluster/docker/recipes
> 
> However, the correct directory appears to be this one:
> 
>  
> https://hg.mozilla.org/try/raw-file/d26ead1312feda1551728ee3264a9d1fc3c478a8/
> taskcluster/scripts
> 
> This suggests a problem with the task rather than with the worker. Joel, do
> you know who could help with this? Thanks.

I found the issue. I'll correct in the next patch
(Assignee)

Comment 28

9 months ago
(In reply to Pete Moore [:pmoore][:pete] from comment #19)
> Dragos,
> 
> What is the strategy for rolling this out to all trees? I see a few
> different options:
> 
> 1) We create a different worker type for generic-worker, and support jobs
> running in both taskcluster-worker and generic-worker for some time, until
> this patch has ridden the trains, and decommission the taskcluster-worker
> worker type when it is no longer used
> 
> 2) We migrate the workers at the same time as landing the gecko patch, and
> we uplift the patch to all trees/support branches etc in one shot. There may
> be fallout from try pushes, but we accept that, and send out emails to warn
> users.
> 
> 3) We set these tasks to tier 3 on branches still using taskcluster-worker
> payload format, and allow them to fail naturally, then raise tier as this
> rolls out. Not sure what tier these tasks are currently - if they are
> already tier 3 - this option seems the best.
> 
> 
> My preference would be option 3, followed by option 1, followed by option 2.
> 
> Joel, thoughts?

I prefer #1 option
Flags: needinfo?(dcrisan)
(Reporter)

Comment 29

9 months ago
Thanks guys!

Dragos, I'm also happy with option 1. So let's go with that.

Are you happy to prepare new patches with the fix you mentioned in comment 27, plus the change to worker type name?

I think you'll also need to make a puppet patch so that we have taskcluster-worker workers and generic-worker workers running in parallel (i.e. you'll need to split the current pool). My suggestion would be to leave one taskcluster-worker, and make the rest generic-worker workers. At some point when the gecko change has landed in all the trees and taskcluster-worker is no longer getting jobs, you can use the entire pool for generic-worker.

Does that sound reasonable to you, and do you have everything you need in order to progress with this?

Many thanks!
Flags: needinfo?(pmoore) → needinfo?(dcrisan)
(Reporter)

Updated

9 months ago
Blocks: 1488390
(Assignee)

Comment 30

9 months ago
Run talos jobs on generic-worker
Attachment #9006237 - Flags: review?(pmoore)
(Reporter)

Updated

9 months ago
Attachment #9004588 - Attachment is obsolete: true
(Reporter)

Comment 31

9 months ago
Comment on attachment 9006237 [details] [diff] [review]
run_talos_jobs_on_generic-worker.patch

Review of attachment 9006237 [details] [diff] [review]:
-----------------------------------------------------------------

Looks reasonable - can you make a try push for this, and run it against the beta worker type?

Also please make sure to also include the js bench tests too (see comment 22). Thanks Dragos!
Attachment #9006237 - Flags: review?(pmoore) → review+
(Reporter)

Comment 32

9 months ago
Note, if we do go for option 1 (from comment 19) we will have to perform the following steps:


Step 1) setup a worker type gecko-t-linux-talos-tw in puppet repo, which takes x% of the machines available (the other (100-x)% remain on gecko-t-linux-talos)
Step 2) push gecko patch to move from using gecko-t-linux-talos to gecko-t-linux-talos-tw
Step 3) uplift this patch around trees, adjusting pool sizes from step 1 along the way, until all trees are using gecko-t-linux-talos-tw
Step 3) monitor for gecko-t-linux-talos tasks for a couple of weeks to make sure no new tasks are being created (e.g. due to people making try pushes from old changesets etc)
Step 4) update gecko-t-linux-talos to run generic-worker via puppet patch
Step 5) make gecko patch to use generic-worker format payload using gecko-t-linux-talos worker type, and test via try push
Step 6) roll out gecko patch, and adjust pool sizes from step 1
Step 7) merge around trees until we're confident we have it everywhere it is needed, and continue to adjust pool sizes
Step 8) monitor gecko-t-linux-talos-tw for a couple of weeks to make sure no new tasks are getting scheduled (again due to try pushes based on old changesets)
Step 9) remove last remaining gecko-t-linux-talos-tw nodes in build-puppet repo (and delete unused modules/manifests etc)


As you see this is quite a bit of work. This is certainly more work than option 2 from comment 19, and would take longer to implement. It might be safer (less risk and less impact to try pushes), but more work for operational staff.
(Assignee)

Comment 35

9 months ago
After discution with pmoore, we identified the following steps to migrate from taskcluster-worker to generic-worker into production:

Step 1) setup a worker type gecko-t-linux-talos-tw in puppet repo, which takes x% of the machines available (the other (100-x)% remain on gecko-t-linux-talos)
Step 2) push gecko patch to move from using gecko-t-linux-talos to gecko-t-linux-talos-tw
Step 3) uplift this patch around trees, adjusting pool sizes from step 1 along the way, until all trees are using gecko-t-linux-talos-tw
Step 3) monitor for gecko-t-linux-talos tasks for a couple of weeks to make sure no new tasks are being created (e.g. due to people making try pushes from old changesets etc)
Step 4) update gecko-t-linux-talos to run generic-worker via puppet patch
Step 5) make gecko patch to use generic-worker format payload using gecko-t-linux-talos worker type, and test via try push
Step 6) roll out gecko patch, and adjust pool sizes from step 1
Step 7) merge around trees until we're confident we have it everywhere it is needed, and continue to adjust pool sizes
Step 8) monitor gecko-t-linux-talos-tw for a couple of weeks to make sure no new tasks are getting scheduled (again due to try pushes based on old changesets)
Step 9) remove last remaining gecko-t-linux-talos-tw nodes in build-puppet repo (and delete unused modules/manifests etc)

Kendall, Joel, are you OK with these steps? Please see also https://bugzilla.mozilla.org/show_bug.cgi?id=1474570#c32
Flags: needinfo?(klibby)
results look good, thanks for the try push.

I am fine with the proposal for the migration, it is reasonable!
Flags: needinfo?(jmaher)
If joel and pete are happy with that plan, then so am I. thanks!
Flags: needinfo?(klibby)
(Assignee)

Comment 38

9 months ago
On taskcluster we have 186 linux machines: 98 on mdc1 and 88 on mdc2
(Assignee)

Comment 39

9 months ago
Added 'releng-hardware/gecko-t-linux-talos-tw': ('native-engine', 'linux') on workertypes
Attachment #9006237 - Attachment is obsolete: true
Attachment #9007182 - Flags: review?(pmoore)
(Assignee)

Comment 40

9 months ago
Attachment #9007182 - Attachment is obsolete: true
Attachment #9007182 - Flags: review?(pmoore)
Attachment #9007217 - Flags: review?
(Assignee)

Updated

9 months ago
Attachment #9007217 - Flags: review? → review?(pmoore)
(Reporter)

Updated

8 months ago
Attachment #9007217 - Flags: review?(pmoore) → review+
(Assignee)

Updated

8 months ago
Attachment #8992963 - Flags: checked-in+
(Assignee)

Comment 41

8 months ago
Almost all mdc1 workers are now on gecko-t-linux-talos-tw pool: 93 from 98

curl https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers|grep "workerId"|wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11946  100 11946    0     0   2047      0  0:00:05  0:00:05 --:--:--  2706
93
(Assignee)

Comment 42

8 months ago
gecko tasks start using gecko-t-linux-talos-tw instead of gecko-t-linux-talos
Attachment #9009919 - Flags: review?(pmoore)
(Reporter)

Updated

8 months ago
Attachment #9009919 - Flags: review?(pmoore) → review+

Comment 43

8 months ago
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/61f90f580122
migrate talos linux tasks from worker type gecko-t-linux-talos to gecko-t-linux-talos-tw,r=pmoore

Comment 44

8 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/61f90f580122
Status: ASSIGNED → RESOLVED
Last Resolved: 8 months ago
Resolution: --- → FIXED
(Assignee)

Updated

8 months ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Updated

8 months ago
Whiteboard: [checkin-needed-beta]
Whiteboard: [checkin-needed-beta]
(Assignee)

Comment 46

8 months ago
Linux workers on moonshot in mdc2 run generic-worker
Attachment #9011451 - Flags: review?(pmoore)
Hi Brian, for https://treeherder.mozilla.org/#/jobs?repo=try&revision=92b41482318129ee085e523151cb7eb0ca35ad66 
What talos jobs are needed here? If there is a lower number of reruns, would those also do the jobs?
Flags: needinfo?(bgrinstead)
(In reply to Andreea Pavel [:apavel] from comment #47)
> Hi Brian, for
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=92b41482318129ee085e523151cb7eb0ca35ad66 
> What talos jobs are needed here?

I'm debugging some regressions on a patch that changes frontend chrome code in toolkit, so there's not one  particular test. If there was a set of suites that we could use to target that type of change I'd be happy to use them. A documented whitelist of talos suites that cover 'browser chrome UI changes' would be great. 

> If there is a lower number of reruns, would
> those also do the jobs?

I've been getting some pretty strange results in the last few days with my usual number of rebuilds (6), so I've been using more.
Flags: needinfo?(bgrinstead)
(Assignee)

Comment 49

8 months ago
Posted file GitHub Pull Request
Attachment #9011451 - Attachment is obsolete: true
Attachment #9011451 - Flags: review?(pmoore)
Attachment #9013615 - Flags: review?(pmoore)
(Assignee)

Updated

8 months ago
Attachment #9009919 - Flags: checked-in+
(Assignee)

Updated

8 months ago
Attachment #9013615 - Flags: review?(dhouse)
(Assignee)

Updated

8 months ago
Attachment #9013615 - Flags: review?(jwatkins)

Updated

8 months ago
Attachment #9013615 - Flags: review?(dhouse) → review+
Comment on attachment 9013615 [details] [review]
GitHub Pull Request

See PR
Attachment #9013615 - Flags: review?(jwatkins) → review-
(Assignee)

Updated

8 months ago
Attachment #9013615 - Flags: checked-in+
(Assignee)

Comment 52

8 months ago
make gecko patch to use generic-worker format payload using gecko-t-linux-talos worker type, and test via try push
(Assignee)

Comment 53

7 months ago
New push on try, to test generic-worker on gecko-t-linux-talos queue: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cb3938c3368ba53847b3d90ce441f5fb2a80cf0
(Reporter)

Updated

7 months ago
Attachment #9013615 - Flags: review?(pmoore)
(Reporter)

Comment 55

7 months ago
(In reply to Dragos Crisan [:dragrom] from comment #54)
> sent another push on try:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=6ced8aec5d5b72bc47c6090faf92ad89cfa7b586

Looks great!

The failures are docker worker tasks, not generic-worker tasks, so I think this should be good to go.

++
(Assignee)

Comment 56

7 months ago
Generate generic-worker playload for linux workers
Attachment #9007217 - Attachment is obsolete: true
(Assignee)

Updated

7 months ago
Attachment #9015480 - Flags: review?(pmoore)
(Assignee)

Comment 57

7 months ago
Run generic-worker tasks on gecko-t-linux-talos queue
Attachment #9015481 - Flags: review?(pmoore)
(Assignee)

Comment 58

7 months ago
Posted file GitHub Pull Request
Move back mdc2 nodes from gecko-t-linux-talos-tw queue to gecko-t-linux-talos queue and run generic worker
Attachment #9015510 - Flags: review?(dhouse)
(Reporter)

Comment 59

7 months ago
Comment on attachment 9015480 [details] [diff] [review]
run_talos_jobs_on_generic-worker.patch

Review of attachment 9015480 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/scripts/tester/test-linux.sh
@@ +27,5 @@
>  : NEED_WINDOW_MANAGER           ${NEED_WINDOW_MANAGER:=false}
>  : NEED_PULSEAUDIO               ${NEED_PULSEAUDIO:=false}
>  : START_VNC                     ${START_VNC:=false}
>  : TASKCLUSTER_INTERACTIVE       ${TASKCLUSTER_INTERACTIVE:=false}
> +: WORKSPACE                     ${WORKSPACE:=workspace}

We need to use an absolute directory here, i.e.:

${WORKSPACE:=$(pwd)/workspace}

Otherwise, when we cd on line 36, the relative paths will break...

Alternatively we can allow a relative path, but then on line 37 we should reset WORKSPACE with e.g.

WORKSPACE=$(pwd)

That is probably my preferred solution so if WORKSPACE is overridden by a user calling this script outside of taskcluster, their relative path should still work.

@@ +32,5 @@
>  : mozharness args               "${@}"
>  
>  set -v
>  mkdir -p $WORKSPACE
>  cd $WORKSPACE

Since we set a relative path in $WORKSPACE on line 31, this cd will break paths such as those on lines 60 and 62.

I think supporting relative paths is reasonable, so my preferred solution is NOT to change line 31, but to inject a line after the `cd $WORKSPACE` line that resets WORKSPACE, i.e.:

WORKSPACE=$(pwd)

This allows relative WORKSPACE paths to be passed into this script without problems.

@@ +62,1 @@
>  mkdir -p $WORKSPACE/build/blobber_upload_dir

The above three mkdir commands should either all be relative, or all absolute.

Probably simplest to make them all relative (i.e. remove `$WORKSPACE/` from lines 60 and 62).

@@ +191,5 @@
>  
>  # Run a custom mach command (this is typically used by action tasks to run
>  # harnesses in a particular way)
> +if [ -n "$CUSTOM_MACH_COMMAND" ]; then
> +    eval "workspace/build/tests/mach ${CUSTOM_MACH_COMMAND}"

This should be:

eval "build/tests/mach ${CUSTOM_MACH_COMMAND}"
Attachment #9015480 - Flags: review?(pmoore) → review-
(Reporter)

Comment 60

7 months ago
Comment on attachment 9015480 [details] [diff] [review]
run_talos_jobs_on_generic-worker.patch

Review of attachment 9015480 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/scripts/tester/test-linux.sh
@@ +27,5 @@
>  : NEED_WINDOW_MANAGER           ${NEED_WINDOW_MANAGER:=false}
>  : NEED_PULSEAUDIO               ${NEED_PULSEAUDIO:=false}
>  : START_VNC                     ${START_VNC:=false}
>  : TASKCLUSTER_INTERACTIVE       ${TASKCLUSTER_INTERACTIVE:=false}
> +: WORKSPACE                     ${WORKSPACE:=workspace}

Sorry, I thought I had deleted the above comment ^^^ - it is superseded by the comment for line 36. :-)
(Reporter)

Comment 61

7 months ago
Comment on attachment 9015481 [details] [diff] [review]
run_generic-worker_tasks_on_gecko-t-linux-talos.patch

Review of attachment 9015481 [details] [diff] [review]:
-----------------------------------------------------------------

Perfect, many thanks!
Attachment #9015481 - Flags: review?(pmoore) → review+

Updated

7 months ago
Attachment #9015510 - Flags: review?(dhouse) → review+
(Reporter)

Comment 62

7 months ago
Hi Dragos,

On reflection, I think this might be the best approach:


diff --git a/taskcluster/scripts/tester/test-linux.sh b/taskcluster/scripts/tester/test-linux.sh
--- a/taskcluster/scripts/tester/test-linux.sh
+++ b/taskcluster/scripts/tester/test-linux.sh
@@ -26,16 +26,17 @@ fi
 : NEED_XVFB                     ${NEED_XVFB:=true}
 : NEED_WINDOW_MANAGER           ${NEED_WINDOW_MANAGER:=false}
 : NEED_PULSEAUDIO               ${NEED_PULSEAUDIO:=false}
 : START_VNC                     ${START_VNC:=false}
 : TASKCLUSTER_INTERACTIVE       ${TASKCLUSTER_INTERACTIVE:=false}
-: WORKSPACE                     ${WORKSPACE:=$HOME/workspace}
+: TASK_DIRECTORY                ${TASK_DIRECTORY:=$(pwd)}
+: WORKSPACE                     ${WORKSPACE:=${TASK_DIRECTORY}/workspace}
 : mozharness args               "${@}"
 
 set -v
-mkdir -p $WORKSPACE
-cd $WORKSPACE
+mkdir -p "$WORKSPACE"
+cd "$WORKSPACE"
 
 fail() {
     echo # make sure error message is on a new line
     echo "[test-linux.sh:error]" "${@}"
     exit 1
@@ -55,19 +56,19 @@ fi
 
 if [[ -z ${MOZHARNESS_SCRIPT} ]]; then fail "MOZHARNESS_SCRIPT is not set"; fi
 if [[ -z ${MOZHARNESS_CONFIG} ]]; then fail "MOZHARNESS_CONFIG is not set"; fi
 
 # make sure artifact directories exist
-mkdir -p $WORKSPACE/build/upload/logs
-mkdir -p ~/artifacts/public
-mkdir -p $WORKSPACE/build/blobber_upload_dir
+mkdir -p "$WORKSPACE/build/upload/logs"
+mkdir -p "$TASK_DIRECTORY/artifacts/public"
+mkdir -p "$WORKSPACE/build/blobber_upload_dir"
 
 cleanup() {
     local rv=$?
     if [[ -s $HOME/.xsession-errors ]]; then
       # To share X issues
-      cp $HOME/.xsession-errors ~/artifacts/public/xsession-errors.log
+      cp "$HOME/.xsession-errors" "$TASK_DIRECTORY/artifacts/public/xsession-errors.log"
     fi
     if $NEED_XVFB; then
         cleanup_xvfb
     fi
     exit $rv
@@ -122,11 +123,11 @@ if $NEED_XVFB; then
     . $HOME/scripts/xvfb.sh
     start_xvfb '1600x1200x24' 0
 fi
 
 if $START_VNC; then
-    x11vnc > ~/artifacts/public/x11vnc.log 2>&1 &
+    x11vnc > "$TASK_DIRECTORY/artifacts/public/x11vnc.log" 2>&1 &
 fi
 
 if $NEED_WINDOW_MANAGER; then
     # This is read by xsession to select the window manager
     echo DESKTOP_SESSION=ubuntu > $HOME/.xsessionrc
@@ -174,11 +175,11 @@ mkdir -p $(dirname $mozharness_bin)
 # Save the computed mozharness command to a binary which is useful
 # for interactive mode.
 echo -e "#!/usr/bin/env bash
 # Some mozharness scripts assume base_work_dir is in
 # the current working directory, see bug 1279237
-cd $WORKSPACE
+cd "$WORKSPACE"
 cmd=\"python2.7 ${MOZHARNESS_PATH}/scripts/${MOZHARNESS_SCRIPT} ${config_cmds} ${@} \${@}\"
 echo \"Running: \${cmd}\"
 exec \${cmd}" > ${mozharness_bin}
 chmod +x ${mozharness_bin}
 
@@ -190,8 +191,8 @@ if ! $TASKCLUSTER_INTERACTIVE; then
 fi
 
 # Run a custom mach command (this is typically used by action tasks to run
 # harnesses in a particular way)
 if [ "$CUSTOM_MACH_COMMAND" ]; then
-    eval "$HOME/workspace/build/tests/mach ${CUSTOM_MACH_COMMAND}"
+    eval "'$WORKSPACE/build/tests/mach' ${CUSTOM_MACH_COMMAND}"
     exit $?
 fi
(Assignee)

Comment 64

7 months ago
Run talos jobs on generic worker
Attachment #9015480 - Attachment is obsolete: true
Attachment #9015557 - Flags: review?(pmoore)
Attachment #9015557 - Flags: review?(dustin)
(Reporter)

Comment 65

7 months ago
Comment on attachment 9015557 [details] [diff] [review]
run_talos_jobs_on_generic-worker.patch

Review of attachment 9015557 [details] [diff] [review]:
-----------------------------------------------------------------

That looks perfect now, many thanks! :)
Attachment #9015557 - Flags: review?(pmoore) → review+
Comment on attachment 9015557 [details] [diff] [review]
run_talos_jobs_on_generic-worker.patch

Review of attachment 9015557 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good!  It took me a bit to work through the refactoring of the existing (native-engine and docker-worker) implementations in run-task.py, but they seem ok (aside from not using workdir).  Just a few things to fix up, then I can look again.

::: taskcluster/taskgraph/transforms/job/mozharness_test.py
@@ +260,5 @@
> +            'c:\\mozilla-build\\python\\python.exe',
> +            '-u',
> +            'mozharness\\scripts\\' + normpath(mozharness['script'])
> +        ]
> +    else:

Add a comment here to help readers
  # is_linux or is_macosx

::: taskcluster/taskgraph/transforms/job/run_task.py
@@ +39,5 @@
>      Required('workdir'): basestring,
>  })
>  
>  
> +def common_setup(config, job, taskdesc, command, checkoutdir='checkouts'):

This default for checkoutdir is not used in the calls to common_setup that I see, and seems incorrect anyway.  Should it be omitted?

@@ +71,5 @@
>  def docker_worker_run_task(config, job, taskdesc):
>      run = job['run']
>      worker = taskdesc['worker'] = job['worker']
> +    command = ['/builds/worker/bin/run-task']
> +    common_setup(config, job, taskdesc, command, checkoutdir='/builds/worker/checkouts')

Why do these no longer use `{workdir}`?  More generally, I'm not sure what the purpose was of factoring out add_checkout_to_command.

@@ +102,4 @@
>  
>      worker['context'] = '{}/raw-file/{}/taskcluster/scripts/run-task'.format(
>          config.params['head_repository'], config.params['head_rev']
>      )

This should use the newly-factored run_task_url, too.
Attachment #9015557 - Flags: review?(dustin) → feedback+
(Assignee)

Comment 67

7 months ago
Attachment #9015557 - Attachment is obsolete: true
Attachment #9016253 - Flags: review?(pmoore)
(Assignee)

Updated

7 months ago
Attachment #9016253 - Flags: review?(pmoore) → review?(dustin)
Comment on attachment 9016253 [details] [diff] [review]
run_talos_jobs_on_generic-worker.patch

Review of attachment 9016253 [details] [diff] [review]:
-----------------------------------------------------------------

Awesome, thanks for fixing those up!
Attachment #9016253 - Flags: review?(dustin) → review+
(Assignee)

Updated

7 months ago
Attachment #9015510 - Flags: checked-in+

Comment 69

7 months ago
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2c6af6cc1ae8
provide support for linux talos tasks on generic-worker without enabling,r=pmoore,r=dustin
https://hg.mozilla.org/integration/mozilla-inbound/rev/f3e71a64a33c
move linux talos tasks from taskcluster-worker on gecko-t-linux-talos-tw to generic-worker on gecko-t-linux-talos,r=pmoore
(Reporter)

Comment 70

7 months ago
Nice work Dragos! \o/
(Reporter)

Updated

7 months ago
Keywords: leave-open
(Reporter)

Comment 72

7 months ago
Failure in docker-worker jobs, e.g. from https://taskcluster-artifacts.net/VSlC8VMmQoGxRAeWvFPRNw/0/public/logs/live_backing.log:

[task 2018-10-12T10:37:30.511Z] executing ['/builds/worker/bin/test-linux.sh', '--installer-url=https://queue.taskcluster.net/v1/task/DeAj5UnNRyigjNpgB-w5iQ/artifacts/public/build/target.apk', '--test-packages-url=https://queue.taskcluster.net/v1/task/DeAj5UnNRyigjNpgB-w5iQ/artifacts/public/build/target.test_packages.json', '--test-suite=crashtest', '--total-chunk=10', '--this-chunk=2', '--download-symbols=true']
[task 2018-10-12T10:37:30.517Z] + set -x -e
[task 2018-10-12T10:37:30.517Z] ++ id
[task 2018-10-12T10:37:30.522Z] + echo 'running as' 'uid=1000(worker)' 'gid=1000(worker)' 'groups=1000(worker),44(video)'
[task 2018-10-12T10:37:30.522Z] running as uid=1000(worker) gid=1000(worker) groups=1000(worker),44(video)
[task 2018-10-12T10:37:30.522Z] + . /etc/lsb-release
[task 2018-10-12T10:37:30.522Z] ++ DISTRIB_ID=Ubuntu
[task 2018-10-12T10:37:30.522Z] ++ DISTRIB_RELEASE=16.04
[task 2018-10-12T10:37:30.522Z] ++ DISTRIB_CODENAME=xenial
[task 2018-10-12T10:37:30.522Z] ++ DISTRIB_DESCRIPTION='Ubuntu 16.04.5 LTS'
[task 2018-10-12T10:37:30.522Z] + '[' 16.04 == 12.04 ']'
[task 2018-10-12T10:37:30.522Z] + '[' 16.04 == 16.04 ']'
[task 2018-10-12T10:37:30.522Z] + UBUNTU_1604=1
[task 2018-10-12T10:37:30.522Z] + : MOZHARNESS_PATH
[task 2018-10-12T10:37:30.522Z] + : MOZHARNESS_URL https://queue.taskcluster.net/v1/task/DeAj5UnNRyigjNpgB-w5iQ/artifacts/public/build/mozharness.zip
[task 2018-10-12T10:37:30.522Z] + : MOZHARNESS_SCRIPT android_emulator_unittest.py
[task 2018-10-12T10:37:30.522Z] + : MOZHARNESS_CONFIG android/android_common.py android/androidarm_4_3.py
[task 2018-10-12T10:37:30.522Z] + : NEED_XVFB true
[task 2018-10-12T10:37:30.522Z] + : NEED_WINDOW_MANAGER true
[task 2018-10-12T10:37:30.522Z] + : NEED_PULSEAUDIO true
[task 2018-10-12T10:37:30.522Z] + : START_VNC false
[task 2018-10-12T10:37:30.522Z] + : TASKCLUSTER_INTERACTIVE false
[task 2018-10-12T10:37:30.523Z] ++ pwd
[task 2018-10-12T10:37:30.523Z] + : TASK_DIRECTORY /
[task 2018-10-12T10:37:30.523Z] + : WORKSPACE //workspace
[task 2018-10-12T10:37:30.523Z] + : mozharness args --installer-url=https://queue.taskcluster.net/v1/task/DeAj5UnNRyigjNpgB-w5iQ/artifacts/public/build/target.apk --test-packages-url=https://queue.taskcluster.net/v1/task/DeAj5UnNRyigjNpgB-w5iQ/artifacts/public/build/target.test_packages.json --test-suite=crashtest --total-chunk=10 --this-chunk=2 --download-symbols=true
[task 2018-10-12T10:37:30.523Z] + set -v
[task 2018-10-12T10:37:30.523Z] mkdir -p "$WORKSPACE"
[task 2018-10-12T10:37:30.523Z] + mkdir -p //workspace
[task 2018-10-12T10:37:30.526Z] mkdir: cannot create directory ‘//workspace’: Permission denied
(Reporter)

Comment 73

7 months ago
This is the same task passing without the patches from this bug:

[task 2018-10-12T10:00:30.501Z] executing ['/builds/worker/bin/test-linux.sh', '--installer-url=https://queue.taskcluster.net/v1/task/SJhdl9vpT5q58pE2YaRoCQ/artifacts/public/build/target.apk', '--test-packages-url=https://queue.taskcluster.net/v1/task/SJhdl9vpT5q58pE2YaRoCQ/artifacts/public/build/target.test_packages.json', '--test-suite=crashtest', '--total-chunk=10', '--this-chunk=2', '--download-symbols=true']
[task 2018-10-12T10:00:30.504Z] + set -x -e
[task 2018-10-12T10:00:30.504Z] ++ id
[task 2018-10-12T10:00:30.506Z] + echo 'running as' 'uid=1000(worker)' 'gid=1000(worker)' 'groups=1000(worker),44(video)'
[task 2018-10-12T10:00:30.506Z] running as uid=1000(worker) gid=1000(worker) groups=1000(worker),44(video)
[task 2018-10-12T10:00:30.506Z] + . /etc/lsb-release
[task 2018-10-12T10:00:30.506Z] ++ DISTRIB_ID=Ubuntu
[task 2018-10-12T10:00:30.506Z] ++ DISTRIB_RELEASE=16.04
[task 2018-10-12T10:00:30.506Z] ++ DISTRIB_CODENAME=xenial
[task 2018-10-12T10:00:30.506Z] ++ DISTRIB_DESCRIPTION='Ubuntu 16.04.5 LTS'
[task 2018-10-12T10:00:30.506Z] + '[' 16.04 == 12.04 ']'
[task 2018-10-12T10:00:30.506Z] + '[' 16.04 == 16.04 ']'
[task 2018-10-12T10:00:30.506Z] + UBUNTU_1604=1
[task 2018-10-12T10:00:30.506Z] + : MOZHARNESS_PATH
[task 2018-10-12T10:00:30.506Z] + : MOZHARNESS_URL https://queue.taskcluster.net/v1/task/SJhdl9vpT5q58pE2YaRoCQ/artifacts/public/build/mozharness.zip
[task 2018-10-12T10:00:30.507Z] + : MOZHARNESS_SCRIPT android_emulator_unittest.py
[task 2018-10-12T10:00:30.507Z] + : MOZHARNESS_CONFIG android/android_common.py android/androidarm_4_3.py
[task 2018-10-12T10:00:30.507Z] + : NEED_XVFB true
[task 2018-10-12T10:00:30.507Z] + : NEED_WINDOW_MANAGER true
[task 2018-10-12T10:00:30.507Z] + : NEED_PULSEAUDIO true
[task 2018-10-12T10:00:30.507Z] + : START_VNC false
[task 2018-10-12T10:00:30.507Z] + : TASKCLUSTER_INTERACTIVE false
[task 2018-10-12T10:00:30.507Z] + : WORKSPACE /builds/worker/workspace
[task 2018-10-12T10:00:30.507Z] + : mozharness args --installer-url=https://queue.taskcluster.net/v1/task/SJhdl9vpT5q58pE2YaRoCQ/artifacts/public/build/target.apk --test-packages-url=https://queue.taskcluster.net/v1/task/SJhdl9vpT5q58pE2YaRoCQ/artifacts/public/build/target.test_packages.json --test-suite=crashtest --total-chunk=10 --this-chunk=2 --download-symbols=true
[task 2018-10-12T10:00:30.507Z] + set -v
[task 2018-10-12T10:00:30.507Z] mkdir -p $WORKSPACE
[task 2018-10-12T10:00:30.507Z] + mkdir -p /builds/worker/workspace


So we see the problem is that before on docker worker we had:

  : WORKSPACE                     ${WORKSPACE:=$HOME/workspace}
                                  => /builds/worker/workspace

but with the patches becomes:

  : TASK_DIRECTORY                ${TASK_DIRECTORY:=$(pwd)}
                                  => /
  : WORKSPACE                     ${WORKSPACE:=${TASK_DIRECTORY}/workspace}
                                  => //workspace


My patch in comment 62 assumed that the working directory was the home directory on docker-worker, which it isn't. The working directory is the root folder (/). So this is my fault.

We need a solution that when test-linux.sh runs in docker-worker, it builds the workspace under the home directory, but when running on generic-worker, it builds the workspace under the current directory.

I would propose we leave the code as it is (with default behaviour of building the workspace under the current directory) but passing in $WORKSPACE as /builds/worker/workerspace via environment variables when generating docker-worker task payloads.
(Reporter)

Comment 74

7 months ago
Also, something else looks wrong:

Here is an example linux talos task running under taskcluster-worker from before the mozilla-inbound push:

https://tools.taskcluster.net/groups/ZT19XamFR-aZfYrpH2esFw/tasks/J0OS8mYITUW6ok__-Yb8Ww/details

Here we see test-linux64-qr/opt-talos-svgr-e10s has the following artifacts:


  "artifacts": [
    {
      "expires": "2019-10-12T09:36:15.966Z",
      "name": "public/logs",
      "path": "workspace/build/upload/logs",
      "type": "directory"
    },
    {
      "expires": "2019-10-12T09:36:15.966Z",
      "name": "public/test",
      "path": "artifacts",
      "type": "directory"
    },
    {
      "expires": "2019-10-12T09:36:15.966Z",
      "name": "public/test_info",
      "path": "workspace/build/blobber_upload_dir",
      "type": "directory"
    }
  ],


After the mozilla-inbound push to run it under generic-worker, it now has:

  "artifacts": [
    {
      "name": "public/logs",
      "path": "logs",
      "type": "directory"
    },
    {
      "name": "public/test_info",
      "path": "build/blobber_upload_dir",
      "type": "directory"
    }
  ]


So for some reason, "public/test" artifact has disappeared in the process.
(Reporter)

Comment 76

7 months ago
(also the paths are wrong)
(Reporter)

Comment 77

7 months ago
Also, there seems to be something strange here. Compare these "test-linux64-qr/opt-talos-svgr-e10s" task definitions:

taskcluster-worker:

  https://queue.taskcluster.net/v1/task/TFs3I7zyTlGM7H7TNVLL8Q

generic-worker:

  https://queue.taskcluster.net/v1/task/J0OS8mYITUW6ok__-Yb8Ww


Not only are the artifacts different, but the commands too...

taskcluster-worker:

    "command": [
      "./test-linux.sh",
      "--installer-url=https://queue.taskcluster.net/v1/task/DhPd6kB2TPqaHlEDWKqFGQ/artifacts/public/build/target.tar.bz2",
      "--test-packages-url=https://queue.taskcluster.net/v1/task/DhPd6kB2TPqaHlEDWKqFGQ/artifacts/public/build/target.test_packages.json",
      "--suite=svgr-e10s",
      "--use-talos-json",
      "--enable-webrender",
      "--download-symbols=ondemand"
    ]

generic-worker:

    "command": [
      [
        "python2.7",
        "-u",
        "mozharness/scripts/talos_script.py",
        "--cfg",
        "mozharness/configs/talos/linux_config.py",
        "--suite=svgr-e10s",
        "--use-talos-json",
        "--enable-webrender",
        "--installer-url",
        "https://queue.taskcluster.net/v1/task/S1QCYQQMTEWWZIspXeyKTQ/artifacts/public/build/target.tar.bz2",
        "--test-packages-url",
        "https://queue.taskcluster.net/v1/task/S1QCYQQMTEWWZIspXeyKTQ/artifacts/public/build/target.test_packages.json",
        "--download-symbols",
        "ondemand",
        "--suite=svgr-e10s",
        "--use-talos-json",
        "--enable-webrender"
      ]
    ]

Dragos, do you know why generic-worker uses mozharness, but taskcluster-worker doesn't?
Flags: needinfo?(dcrisan)
(Reporter)

Comment 78

7 months ago
Hey Joel, comparing the taskcluster-worker versus generic-worker commands in comment 77, do you know if either are ok to use, and if not, which is the correct/preferred way to run the test suite?

For the purposes of the migration I would certainly prefer that generic-worker mimicked the task cluster-worker implementation, but if it turns out that it is preferred to run via mozharness, maybe it is ok. I'm pretty certain we will want to the taskcluster-worker stype implementation, but just checking with you just in case.
Flags: needinfo?(jmaher)
(Reporter)

Comment 79

7 months ago
... and I've just spotted that several of the arguments are duplicted, e.g. "--enable-webrender" is listed twice, so is "--suite=svgr-e10s" - so something is clearly not quite right.. :(
I don't know the preferred way, just whatever works to be honest.

I see test-linux.sh:
https://searchfox.org/mozilla-central/source/taskcluster/scripts/tester/test-linux.sh

that seems important and I suspect that is what is currently run to setup the machine and then it calls talos_script.py.  Most of our automated tests have some duplicated args in the command line- this is usually a result of transforms and yaml configs duplicating work.
Flags: needinfo?(jmaher)
(Reporter)

Comment 81

7 months ago
Note, we should also change this line:

  : WORKSPACE                     ${WORKSPACE:=${TASK_DIRECTORY}/workspace}

to:

  : WORKSPACE                     ${WORKSPACE:=${TASK_DIRECTORY%/}/workspace}

This will remove a trailing slash if it exists, before adding a slash. This is important especially if TASK_DIRECTORY == '/'.

Then we should rename TASK_DIRECTORY to something like CURRENT_DIRECTORY since this script doesn't need to run as a task, a user may wish to call it directly outside taskcluster and outside the context of a task.
(Assignee)

Updated

7 months ago
Flags: needinfo?(dcrisan)
(Assignee)

Comment 83

7 months ago
Run linux talos jobs using generic-worker
Attachment #9016253 - Attachment is obsolete: true
Attachment #9017866 - Flags: review?(pmoore)
(Reporter)

Comment 84

7 months ago
Comment on attachment 9017866 [details] [diff] [review]
run_talos_jobs_on_generic-worker.patch

Review of attachment 9017866 [details] [diff] [review]:
-----------------------------------------------------------------

Looks perfect, many thanks!
Attachment #9017866 - Flags: review?(pmoore) → review+

Comment 85

7 months ago
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7525008369c
migrate talos linux tasks to run using generic-worker and use worker type gecko-t-linux-talos,r=pmoore
(Assignee)

Updated

7 months ago
Attachment #9015481 - Flags: checked-in+
(Assignee)

Updated

7 months ago
Attachment #9017866 - Flags: checked-in+

Updated

7 months ago
See Also: → 1501250
(Assignee)

Updated

7 months ago
Whiteboard: [checkin-needed-beta]
This updated some of our Talos baselines:

== Change summary for alert #17070 (as of Tue, 23 Oct 2018 06:16:13 GMT) ==

Improvements:

  7%  sessionrestore linux64 pgo e10s stylo                     351.67 -> 325.33
  6%  sessionrestore_no_auto_restore linux64 pgo e10s stylo     367.08 -> 344.75

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=17070

Comment 88

7 months ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/d96335858456
Whiteboard: [checkin-needed-beta]
(Assignee)

Updated

7 months ago
Attachment #9019965 - Flags: checked-in+
(Assignee)

Comment 90

7 months ago
Moved 60 workers from gecko-t-linux-talos-tw to gecko-t-linux-talos queue, to run generic-worker. Now, we have the following number of workers and the following workers range:

In gecko-t-linux-talos-tw queue:
Workers range:
 t-linux64-ms-181 - 195 - 15 workers
 t-linux64-ms-226 - 239 - 14 workers
 t-linux64-ms-271 - 279 - 9 workers
TOTAL = 38 workers

In gecko-t-linux-talos queue:
Workers range:
 t-linux64-ms-001 - 015 - 15 workers
 t-linux64-ms-046 - 060 - 15 workers
 t-linux64-ms-091 - 105 - 15 workers
 t-linux64-ms-136 - 150 - 15 workers
 t-linux64-ms-301 - 315 - 15 workers
 t-linux64-ms-346 - 360 - 15 workers
 t-linux64-ms-391 - 405 - 15 workers
 t-linux64-ms-436 - 450 - 15 workers
 t-linux64-ms-481 - 495 - 15 workers
 t-linux64-ms-526 - 540 - 15 workers
TOTAL = 150 workers

All workers from gecko-t-linux-talos queue run generic-worker
The workers from gecko-t-linux-talos-tw queue run taskcluster-worker
https://hg.mozilla.org/projects/larch/rev/2c6af6cc1ae8bd7d8227f377fb0d8e255d41b918
Bug 1474570 - provide support for linux talos tasks on generic-worker without enabling,r=pmoore,r=dustin

https://hg.mozilla.org/projects/larch/rev/f3e71a64a33cf74135362b6d278b2acb6600454e
Bug 1474570 - move linux talos tasks from taskcluster-worker on gecko-t-linux-talos-tw to generic-worker on gecko-t-linux-talos,r=pmoore

https://hg.mozilla.org/projects/larch/rev/f7525008369c91098d956634894662ffa185de22
Bug 1474570 - migrate talos linux tasks to run using generic-worker and use worker type gecko-t-linux-talos,r=pmoore
(Assignee)

Updated

7 months ago
Attachment #9022837 - Flags: checked-in+
(Assignee)

Comment 93

7 months ago
All linux workers are now on gecko-t-linux-talos queue
(Assignee)

Comment 94

6 months ago
All linux workers are now on gecko-t-linux-talos queue, workers from gecko-t-linux-talos queue run generic-worker
Status: REOPENED → RESOLVED
Last Resolved: 8 months ago6 months ago
Resolution: --- → FIXED
Blocks: 1545368
You need to log in before you can comment on or make changes to this bug.