Closed Bug 1482344 Opened Last year Closed Last year

raptor fails to run fetch benchmarks after moving to hardware

Categories

(Testing :: Raptor, enhancement)

enhancement
Not set

Tracking

(firefox63 fixed)

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: jmaher, Assigned: ahal)

References

Details

Attachments

(2 files)

raptor unittests unity3d and local one wasm-misc work fine when they are run on virtual machines, but when they are run on physical hardware they fail.  In looking at logs before/after, we would fetch the task and post the data in the benchmarks directory, now only benchmarks from third_party/webkit/PerformanceTests/ seem to appear in our benchmark directory while running tests.
on a virtual machine, I see this in the log:
[taskcluster 2018-08-09 13:48:23.520Z] === Task Starting ===
[setup 2018-08-09T13:48:23.986Z] run-task started in /builds/worker
[cache 2018-08-09T13:48:23.989Z] cache /builds/worker/checkouts exists; requirements: gid=1000 uid=1000 version=1
[cache 2018-08-09T13:48:23.989Z] cache /builds/worker/workspace exists; requirements: gid=1000 uid=1000 version=1
[volume 2018-08-09T13:48:23.990Z] changing ownership of volume /builds/worker/.cache to 1000:1000
[volume 2018-08-09T13:48:23.990Z] volume /builds/worker/checkouts is a cache
[volume 2018-08-09T13:48:23.990Z] changing ownership of volume /builds/worker/tooltool-cache to 1000:1000
[volume 2018-08-09T13:48:23.990Z] volume /builds/worker/workspace is a cache
[setup 2018-08-09T13:48:23.991Z] running as worker:worker
[fetches 2018-08-09T13:48:23.991Z] fetching artifacts
Downloading https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/public/wasm-misc.zip to /builds/worker/fetches/wasm-misc.zip.tmp
Downloading https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/public/wasm-misc.zip
https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/public/wasm-misc.zip resolved to 4433793 bytes with sha256 0ba273b748b872117a4b230c776bbd73550398da164025a735c28a16c0224397 in 0.619s
Renaming to /builds/worker/fetches/wasm-misc.zip
Extracting /builds/worker/fetches/wasm-misc.zip to /builds/worker/fetches using ['unzip', '/builds/worker/fetches/wasm-misc.zip']
Archive:  /builds/worker/fetches/wasm-misc.zip
   creating: wasm-misc/
...
/builds/worker/fetches/wasm-misc.zip extracted in 0.136s
Removing /builds/worker/fetches/wasm-misc.zip
[fetches 2018-08-09T13:48:24.867Z] finished fetching artifacts
[task 2018-08-09T13:48:24.867Z] executing ['/builds/worker/bin/test-linux.sh', '--installer-url=https://queue.taskcluster.net/v1/task/cMNgzfDCRJSd6A9blGVoBw/artifacts/public/build/target.tar.bz2', '--test-packages-url=https://queue.taskcluster.net/v1/task/cMNgzfDCRJSd6A9blGVoBw/artifacts/public/build/target.test_packages.json', '--test=raptor-wasm-misc', '--branch-name', 'try', '--download-symbols=ondemand']


on hardware we don't run test-linux.sh, is it possible that we have different features in docker-worker vs <whatever>-worker that we are using on hardware?
Flags: needinfo?(wcosta)
Flags: needinfo?(ahal)
I see :ahal recently added a fetch_artifacts support in run-task:
https://searchfox.org/mozilla-central/source/taskcluster/scripts/run-task#742

this looks as if it is supported in both docker-worker and native-engine, but I found that native-engine (i.e. hardware) doesn't have MOZ_FETCHES defined in the environment variables.
it appears that edit+retrigger to add MOZ_FETCH* env vars doesn't solve this problem.
(In reply to Joel Maher ( :jmaher ) (UTC+2) from comment #1)
> on a virtual machine, I see this in the log:
> [taskcluster 2018-08-09 13:48:23.520Z] === Task Starting ===
> [setup 2018-08-09T13:48:23.986Z] run-task started in /builds/worker
> [cache 2018-08-09T13:48:23.989Z] cache /builds/worker/checkouts exists;
> requirements: gid=1000 uid=1000 version=1
> [cache 2018-08-09T13:48:23.989Z] cache /builds/worker/workspace exists;
> requirements: gid=1000 uid=1000 version=1
> [volume 2018-08-09T13:48:23.990Z] changing ownership of volume
> /builds/worker/.cache to 1000:1000
> [volume 2018-08-09T13:48:23.990Z] volume /builds/worker/checkouts is a cache
> [volume 2018-08-09T13:48:23.990Z] changing ownership of volume
> /builds/worker/tooltool-cache to 1000:1000
> [volume 2018-08-09T13:48:23.990Z] volume /builds/worker/workspace is a cache
> [setup 2018-08-09T13:48:23.991Z] running as worker:worker
> [fetches 2018-08-09T13:48:23.991Z] fetching artifacts
> Downloading
> https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/
> public/wasm-misc.zip to /builds/worker/fetches/wasm-misc.zip.tmp
> Downloading
> https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/
> public/wasm-misc.zip
> https://queue.taskcluster.net/v1/task/XGuKvVIKTqi2FDJc_lWG-w/artifacts/
> public/wasm-misc.zip resolved to 4433793 bytes with sha256
> 0ba273b748b872117a4b230c776bbd73550398da164025a735c28a16c0224397 in 0.619s
> Renaming to /builds/worker/fetches/wasm-misc.zip
> Extracting /builds/worker/fetches/wasm-misc.zip to /builds/worker/fetches
> using ['unzip', '/builds/worker/fetches/wasm-misc.zip']
> Archive:  /builds/worker/fetches/wasm-misc.zip
>    creating: wasm-misc/
> ...
> /builds/worker/fetches/wasm-misc.zip extracted in 0.136s
> Removing /builds/worker/fetches/wasm-misc.zip
> [fetches 2018-08-09T13:48:24.867Z] finished fetching artifacts
> [task 2018-08-09T13:48:24.867Z] executing
> ['/builds/worker/bin/test-linux.sh',
> '--installer-url=https://queue.taskcluster.net/v1/task/
> cMNgzfDCRJSd6A9blGVoBw/artifacts/public/build/target.tar.bz2',
> '--test-packages-url=https://queue.taskcluster.net/v1/task/
> cMNgzfDCRJSd6A9blGVoBw/artifacts/public/build/target.test_packages.json',
> '--test=raptor-wasm-misc', '--branch-name', 'try',
> '--download-symbols=ondemand']
> 
> 
> on hardware we don't run test-linux.sh, is it possible that we have
> different features in docker-worker vs <whatever>-worker that we are using
> on hardware?

It doesn't seem to be related to worker setup. Do you have a link to the failing task?
Flags: needinfo?(wcosta)
I noticed that on packet, it searches for the home directory at /home/cltbld, shouldn't it be /build/worker?
Flags: needinfo?(jmaher)
I think it's the other way around, those native-engine workers run from /home/cltbld. Joel, I think you need to add the 'workdir' key to raptor.yml similar to what I needed for the jsshell-bench tasks:
https://searchfox.org/mozilla-central/source/taskcluster/ci/source-test/jsshell.yml#19

Note those jsshell tasks are currently the only things using both run-task and a native-engine worker, so there are still edge cases that haven't been smoothed over.
Flags: needinfo?(ahal)
Oh, but because raptor.yml is a "test" kind (and not a "source-test" kind like jsshell), you'll need to figure out how to propagate this value from raptor.yml up to here:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/__init__.py#199

There may also very well be other problems. These are the first "test" tasks to use native-engine + fetches.
(In reply to Andrew Halberstadt [:ahal] from comment #7)
> I think it's the other way around, those native-engine workers run from
> /home/cltbld. Joel, I think you need to add the 'workdir' key to raptor.yml
> similar to what I needed for the jsshell-bench tasks:
> https://searchfox.org/mozilla-central/source/taskcluster/ci/source-test/
> jsshell.yml#19
> 
> Note those jsshell tasks are currently the only things using both run-task
> and a native-engine worker, so there are still edge cases that haven't been
> smoothed over.

Ops, my bad, I am so biased to packet.net that I assumed the task was running there.
Flags: needinfo?(jmaher)
This is happening because the 'native-engine' implementation in mozharness_test.py is overwriting the worker's env instead of updating it:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/mozharness_test.py#340

Though the workdir also needed to be set as per comment 7.
Assignee: nobody → ahal
Status: NEW → ASSIGNED
Turns out that still wasn't enough because the native-engine workers don't use `run-task` (mozharness_test.py could use some TLC), which means MOZ_FETCHES aren't downloaded automatically.

There are two options:
A) Try to mount the run-task and fetch-content scripts on these workers and modify mozharness_test.py to always use run-task.
B) Download the fetches in mozharness (there is precedent here from the code-coverage tasks)

Option A is more aligned with the future we want to see, so I'll give that a brief shot. If I can't get it to work for any reason, I'll fallback to option B.
We need to grab fetches from several place in mozharness, this creates a
dedicated mixin that can be used from anywhere. If the 'fetch-content' script
is detected that will be used, otherwise we download the fetches manually.
This unbreaks some tier 3 raptor tasks. There are a few fixes rolled together here:
1) Stop overwriting the 'env' in mozharness_test.py's 'native-engine' implementation
2) Set the workdir to /home/cltbld (which makes sure the fetches are downloaded to there)
3) Download the fetches via mozharness in the 'raptor' script (since they don't use run-task anymore)

Depends on D3651
Comment on attachment 9002065 [details]
Bug 1482344 - [raptor] Fix fetch tasks for native-engine mozharness_test based tasks, r=jmaher

Joel Maher ( :jmaher ) (UTC+2) has approved the revision.
Attachment #9002065 - Flags: review+
Comment on attachment 9002064 [details]
Bug 1482344 - [mozharness] Refactor codecoverage fetch downloading into a standalone mixin, r=marco

Tudor-Gabriel Vijiala [:tvijiala] has approved the revision.
Attachment #9002064 - Flags: review+
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/95e338482796
[mozharness] Refactor codecoverage fetch downloading into a standalone mixin, r=tvijiala
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aa6f46eaec1b
[raptor] Fix fetch tasks for native-engine mozharness_test based tasks, r=jmaher
https://hg.mozilla.org/mozilla-central/rev/95e338482796
https://hg.mozilla.org/mozilla-central/rev/aa6f46eaec1b
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
You need to log in before you can comment on or make changes to this bug.