Closed Bug 1369537 Opened 4 years ago Closed 4 years ago

adjust talos configs so we can run talos on windows + linux via taskcluster

Categories

(Testing :: Talos, enhancement)

enhancement
Not set
normal

Tracking

(firefox58 fixed)

RESOLVED FIXED
mozilla58
Tracking Status
firefox58 --- fixed

People

(Reporter: jmaher, Assigned: rwood)

References

Details

(Whiteboard: [PI:October])

Attachments

(2 files)

in the world where we run taskcluster natively there will need to be some changes to talos, specifically on windows.

In this try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4f340dd78273387a31ca8cc880a951147f29bd31&selectedJob=103816119

I made a series of changes:
https://hg.mozilla.org/try/rev/e02cb46d80bc4825705fde74e2c0192646bae0a7

this works to get green results for us, but the new mitmproxy (q1) job fails.

We should work on making these official configs so the work in the future to stand up talos for windows (either via BBB or natively on TC) will be quicker.
bug 1372324 will have an example of running talos on linux64
Depends on: 1372324
Whiteboard: [PI:June] → [PI:August]
Assignee: nobody → ionut.goldan
Whiteboard: [PI:August] → [PI:September]
here is a push that I just did which has the configs changed to run on vm:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a3f27724577dc18a7110f472815e2af6681b9512

Ideally we could:
1) take the .py changes and make them permanent if possible
2) remove hardware from talos configs and make a transform to reduce the work to shift or random error
3) any other changes that you see fit :)
adding in linux to the scope to make sure we can run the tests in a linux VM- again, not looking for tests passing, just running.
Summary: adjust talos configs so we can run talos on windows via taskcluster → adjust talos configs so we can run talos on windows + linux via taskcluster
one hiccup in my try push above is that tp6 fails ( https://public-artifacts.taskcluster.net/Mm-mMpmIT26cWFH_tYe_BQ/0/public/logs/live_backing.log ):
19:07:00    ERROR - Return code: -1073741515
19:07:00     INFO - Getting output from command: ['Z:\\task_1505153366\\build\\python3.6\\python', '--version']
19:07:00     INFO - Copy/paste: Z:\task_1505153366\build\python3.6\python --version
19:07:00     INFO - Using _rmtree_windows ...
19:07:00     INFO - Running command: del /F /Q "Z:\task_1505153366\tmpfile_stderr"
19:07:00     INFO - Return code: 0
19:07:00     INFO - Using _rmtree_windows ...
19:07:00     INFO - Running command: del /F /Q "Z:\task_1505153366\tmpfile_stdout"
19:07:00     INFO - Return code: 0
19:07:00    ERROR - Return code: -1073741515
19:07:00     INFO - Running post-action listener: _resource_record_post_action
19:07:00     INFO - [mozharness: 2017-09-11 19:07:00.055000Z] Finished setup-mitmproxy step (failed)
19:07:00    FATAL - Uncaught exception: Traceback (most recent call last):
19:07:00    FATAL -   File "Z:\task_1505153366\mozharness\mozharness\base\script.py", line 2059, in run
19:07:00    FATAL -     self.run_action(action)
19:07:00    FATAL -   File "Z:\task_1505153366\mozharness\mozharness\base\script.py", line 1998, in run_action
19:07:00    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
19:07:00    FATAL -   File "Z:\task_1505153366\mozharness\mozharness\base\script.py", line 1938, in _possibly_run_method
19:07:00    FATAL -     return getattr(self, method_name)()
19:07:00    FATAL -   File "Z:\task_1505153366\mozharness\mozharness\mozilla\testing\talos.py", line 436, in setup_mitmproxy
19:07:00    FATAL -     self.setup_py3_virtualenv()
19:07:00    FATAL -   File "Z:\task_1505153366\mozharness\mozharness\mozilla\testing\talos.py", line 453, in setup_py3_virtualenv
19:07:00    FATAL -     self.py3_venv_configuration(python_path=self.py3_path, venv_path='py3venv')
19:07:00    FATAL -   File "Z:\task_1505153366\mozharness\mozharness\base\python.py", line 815, in py3_venv_configuration
19:07:00    FATAL -     [self.py3_python_path, '--version'], env=self.query_env()).split()[-1]
19:07:00    FATAL - AttributeError: 'NoneType' object has no attribute 'split'
19:07:00    FATAL - Running post_fatal callback...
19:07:00    FATAL - Exiting -1


------

I do wonder if we are getting python3.6 downloaded:
19:06:53     INFO - proxxy config: {'regions': ['.use1.', '.usw2.'], 'instances': ['proxxy1.srv.releng.use1.mozilla.com', 'proxxy1.srv.releng.usw2.mozilla.com'], 'urls': [('http://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp-ssl.mozilla.org', 'ftp.mozilla.org'), ('http://pypi.pvt.build.mozilla.org', 'pypi.pvt.build.mozilla.org'), ('http://pypi.pub.build.mozilla.org', 'pypi.pub.build.mozilla.org')]}
19:06:53     INFO - retry: Calling run_command with args: (['c:\\mozilla-build\\python\\python.exe', 'C:\\mozilla-build\\tooltool.py', '--url', 'https://tooltool.mozilla-releng.net/', '--authentication-file', 'c:\\builds\\relengapi.tok', 'fetch', '-m', 'Z:\\task_1505153366\\build\\tests\\talos\\talos\\mitmproxy\\python3.manifest', '-o', '-c', 'c:\\build\\tooltool_cache'],), kwargs: {'output_timeout': 600, 'error_list': [{'substr': 'command not found', 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x015516E0>, 'level': 'warning'}, {'substr': 'Traceback (most recent call last)', 'level': 'error'}, {'substr': 'SyntaxError: ', 'level': 'error'}, {'substr': 'TypeError: ', 'level': 'error'}, {'substr': 'NameError: ', 'level': 'error'}, {'substr': 'ZeroDivisionError: ', 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x01535740>, 'level': 'critical'}, {'regex': <_sre.SRE_Pattern object at 0x01AF2240>, 'level': 'critical'}, {'substr': 'ERROR - ', 'level': 'error'}], 'cwd': 'Z:\\task_1505153366\\build', 'privileged': False}, attempt #1
19:06:53     INFO - Running command: ['c:\\mozilla-build\\python\\python.exe', 'C:\\mozilla-build\\tooltool.py', '--url', 'https://tooltool.mozilla-releng.net/', '--authentication-file', 'c:\\builds\\relengapi.tok', 'fetch', '-m', 'Z:\\task_1505153366\\build\\tests\\talos\\talos\\mitmproxy\\python3.manifest', '-o', '-c', 'c:\\build\\tooltool_cache'] in Z:\task_1505153366\build
19:06:53     INFO - Copy/paste: c:\mozilla-build\python\python.exe C:\mozilla-build\tooltool.py --url https://tooltool.mozilla-releng.net/ --authentication-file c:\builds\relengapi.tok fetch -m Z:\task_1505153366\build\tests\talos\talos\mitmproxy\python3.manifest -o -c c:\build\tooltool_cache
19:06:53     INFO - Calling ['c:\\mozilla-build\\python\\python.exe', 'C:\\mozilla-build\\tooltool.py', '--url', 'https://tooltool.mozilla-releng.net/', '--authentication-file', 'c:\\builds\\relengapi.tok', 'fetch', '-m', 'Z:\\task_1505153366\\build\\tests\\talos\\talos\\mitmproxy\\python3.manifest', '-o', '-c', 'c:\\build\\tooltool_cache'] with output_timeout 600
19:06:53     INFO -  INFO - File python3.6.zip not present in local cache folder c:\build\tooltool_cache
19:06:53     INFO -  INFO - Attempting to fetch from 'https://tooltool.mozilla-releng.net/'...
19:06:54     INFO -  INFO - File python3.6.zip fetched from https://tooltool.mozilla-releng.net/ as Z:\task_1505153366\build\tmpok_nt7
19:06:54     INFO -  INFO - File integrity verified, renaming tmpok_nt7 to python3.6.zip
19:06:54     INFO -  INFO - Updating local cache c:\build\tooltool_cache...
19:06:54     INFO -  INFO - Local cache c:\build\tooltool_cache updated with python3.6.zip
19:06:54     INFO -  INFO - unzipping "python3.6.zip"
19:07:00     INFO - Return code: 0


I see the error from tooltool.py, possibly this is a clue, or a red herring.  I look at the g5 job on the same push ( https://public-artifacts.taskcluster.net/ax5w3Sh9SLWXUQ_30qs5nQ/0/public/logs/live_backing.log ), and see the same error for when tooltool.py download tp5n.zip:
19:05:55     INFO - proxxy config: {'regions': ['.use1.', '.usw2.'], 'instances': ['proxxy1.srv.releng.use1.mozilla.com', 'proxxy1.srv.releng.usw2.mozilla.com'], 'urls': [('http://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp-ssl.mozilla.org', 'ftp.mozilla.org'), ('http://pypi.pvt.build.mozilla.org', 'pypi.pvt.build.mozilla.org'), ('http://pypi.pub.build.mozilla.org', 'pypi.pub.build.mozilla.org')]}
19:05:55     INFO - retry: Calling run_command with args: (['c:\\mozilla-build\\python\\python.exe', 'C:\\mozilla-build\\tooltool.py', '--url', 'https://tooltool.mozilla-releng.net/', '--authentication-file', 'c:\\builds\\relengapi.tok', 'fetch', '-m', 'Z:\\task_1505155940\\build\\tests\\talos\\tp5n-pageset.manifest', '-o', '-c', 'c:\\build\\tooltool_cache'],), kwargs: {'output_timeout': 600, 'error_list': [{'substr': 'command not found', 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x015716E0>, 'level': 'warning'}, {'substr': 'Traceback (most recent call last)', 'level': 'error'}, {'substr': 'SyntaxError: ', 'level': 'error'}, {'substr': 'TypeError: ', 'level': 'error'}, {'substr': 'NameError: ', 'level': 'error'}, {'substr': 'ZeroDivisionError: ', 'level': 'error'}, {'regex': <_sre.SRE_Pattern object at 0x01555740>, 'level': 'critical'}, {'regex': <_sre.SRE_Pattern object at 0x01C10240>, 'level': 'critical'}, {'substr': 'ERROR - ', 'level': 'error'}], 'cwd': 'Z:\\task_1505155940\\build\\tests\\talos\\talos\\tests', 'privileged': False}, attempt #1
19:05:55     INFO - Running command: ['c:\\mozilla-build\\python\\python.exe', 'C:\\mozilla-build\\tooltool.py', '--url', 'https://tooltool.mozilla-releng.net/', '--authentication-file', 'c:\\builds\\relengapi.tok', 'fetch', '-m', 'Z:\\task_1505155940\\build\\tests\\talos\\tp5n-pageset.manifest', '-o', '-c', 'c:\\build\\tooltool_cache'] in Z:\task_1505155940\build\tests\talos\talos\tests
19:05:55     INFO - Copy/paste: c:\mozilla-build\python\python.exe C:\mozilla-build\tooltool.py --url https://tooltool.mozilla-releng.net/ --authentication-file c:\builds\relengapi.tok fetch -m Z:\task_1505155940\build\tests\talos\tp5n-pageset.manifest -o -c c:\build\tooltool_cache
19:05:55     INFO - Calling ['c:\\mozilla-build\\python\\python.exe', 'C:\\mozilla-build\\tooltool.py', '--url', 'https://tooltool.mozilla-releng.net/', '--authentication-file', 'c:\\builds\\relengapi.tok', 'fetch', '-m', 'Z:\\task_1505155940\\build\\tests\\talos\\tp5n-pageset.manifest', '-o', '-c', 'c:\\build\\tooltool_cache'] with output_timeout 600
19:05:55     INFO -  INFO - File tp5n.zip not present in local cache folder c:\build\tooltool_cache
19:05:55     INFO -  INFO - Attempting to fetch from 'https://tooltool.mozilla-releng.net/'...
19:06:02     INFO -  INFO - File tp5n.zip fetched from https://tooltool.mozilla-releng.net/ as Z:\task_1505155940\build\tests\talos\talos\tests\tmpriqung
19:06:02     INFO -  INFO - File integrity verified, renaming tmpriqung to tp5n.zip
19:06:02     INFO -  INFO - Updating local cache c:\build\tooltool_cache...
19:06:03     INFO -  INFO - Local cache c:\build\tooltool_cache updated with tp5n.zip
19:06:03     INFO - Return code: 0
19:06:03     INFO - Running command: ['unzip', '-q', '-o', u'Z:\\task_1505155940\\build\\tests\\talos\\talos\\tests\\tp5n.zip', '-d', 'Z:\\task_1505155940\\build\\tests\\talos\\talos\\tests']
19:06:03     INFO - Copy/paste: unzip -q -o Z:\task_1505155940\build\tests\talos\talos\tests\tp5n.zip -d Z:\task_1505155940\build\tests\talos\talos\tests
19:06:53     INFO - Return code: 0


except this time it runs an unzip command- looking at a passing tp6, I don't see any copy/paste lines for unzip, so this leads me to believe we are downloading python3.6.zip properly and unzipping it properly- now where is the failure?

I expect to see this:
03:32:03     INFO - Copy/paste: C:\slave\test\build\python3.6\python --version
03:32:03     INFO - Reading from file tmpfile_stdout
03:32:03     INFO - Output received:
03:32:03     INFO -  Python 3.6.1

but we execute:
Z:\task_1505153366\build\python3.6\python --version

could it be that placing an executable in z:\ is problematic (most likely not as firefox executes from Z:\task_1505153366\build\application\firefox\firefox.exe)?  We execute python from c:\\mozilla-build\\python\\python.exe on taskcluster, so possibly there is some issue there either with the environment variables or other packaging.


possibly getting a loaner would be a good next step:
https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance
:igoldan, is this a bug you can work on this week and next week?  If not, I want to get this in my queue or :rwood's queue for work.
Flags: needinfo?(ionut.goldan)
I'm afraid not, as I already have 2 tasks I'm looking over.
Flags: needinfo?(ionut.goldan)
:rwood, do you think you will have time to look into this in the next week or so?
Assignee: ionut.goldan → nobody
Flags: needinfo?(rwood)
Will do!
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Flags: needinfo?(rwood)
First step, I imported your patch locally and landed on try, and reproduced the same tp6 failure:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=61241c7d01641148aeda8922dbb2ac405bd5b14f&selectedJob=132276204
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4)
...
> could it be that placing an executable in z:\ is problematic (most likely
> not as firefox executes from
> Z:\task_1505153366\build\application\firefox\firefox.exe)?  We execute
> python from c:\\mozilla-build\\python\\python.exe on taskcluster, so
> possibly there is some issue there either with the environment variables or
> other packaging.

I don't believe it's a path issue, the paths where we setup python 3.6 (Z:) are the same in this VM setup as they are on the existing Win 10 'tp6' job which is green.

I'm still investigating; env vars look the same so far. At this point I'm *guessing* it's an issue with the Python 3.6 tooltool package that we are installing for the mitmproxy virtualenv; perhaps our package version of Python 3.6 works on Win 7 hardware but for some reason fails on a Win 7 VM...
Update: So I was able to get this to work on the win 7 vm, instead of using our python35 package from tooltool and creating a special virtualenv, I have it install mitmproxy from a release binary which doesn't need python 3.5+ [2] the same way we currently do for osx, and that worked - tp6 ran fine and passed in the win 7 vm. That's the good news, however, the bad news, the same mitmproxy release binary didn't work on our current win 10 hw setup in this same try push, not sure why, perhaps it doesn't support x64.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=eee94231813efa0f9b67891ac87e76a86e42bdef&selectedJob=132550366
[2] https://github.com/mitmproxy/mitmproxy/releases/
possibly a loaner and some hacking would help get to the answer.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #12)
> possibly a loaner and some hacking would help get to the answer.

Will we be replacing win 10 with a vm also? Assuming so... in which case yep I'll need a win 10 vm loaner, and this will take longer than I was hoping unfortunately.
as it stands right now- yes win10 will have a lightweight VM layer.  I think just using:
https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance

No guarantee that will work, but a good first try.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14)
...
Ok, thanks!
Update: I have a win 10 vm loaner (using the link in comment 14) and am trying to get a talos/dev env up and running on it.
Depends on: 1404450
Whiteboard: [PI:September] → [PI:October]
Update: where it stands, tp6 as-is (using or python 3.x tooltool package for a 2nd virtual env) works on the win 10 vm but fails on the win 7 vm.

Made some progress with a tc win7 (vm) loaner. Turns out a "Python.exe - system error" dialog is popping up on the win 7 loaner when our tooltool python 3.x executable is invoked, with this error: "The program can't start because api-ms-win-crt-runtime-l1-1-0.dll is missing from your computer. Try reinstalling the program to fix this"
Awesome - downloading the "Update for universal C runtime in windows" [2] onto the win 7 vm fixes it. Ran the installer at [2] and now the tp6 suite runs fine on the win 7 vm. I will follow up with this in Bug 1400365.

[1] https://support.microsoft.com/en-us/help/2999226/update-for-universal-c-runtime-in-windows
[2] https://www.microsoft.com/en-us/download/details.aspx?id=49077
Hi :pmoore, :grenade, would you be able to add the 'windows universal C runtime' (package in comment 18) to the existing tc win 7 vms please? This will enable talos tp6 to run there. This is only an issue on the win 7 vm, the win 10 vms already work fine with tp6. Thanks!
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
attempted to install the kb (https://github.com/mozilla-releng/OpenCloudConfig/commit/dbb87a6a10637d8ac805e172e236a5d92b032c79) using the powershell dsc mechanism but for some reason it hangs at the wusa call and subsequently the ami regeneration task times out. i suspect the problem is with the silent installer missing a flag to tell it to run in non interactive mode or similar, so it's awaiting user input that it never gets.

we can work around this for now by installing the kb directly on the base ami (or hardware via mdt) if we know which worker types will need it. my guess is gecko-t-win7-32-hw. is that correct? or is it all win 7 instances?
Flags: needinfo?(rthijssen)
we are trying to prepare our tests to run on the new HW/VM machines coming online soon- tp6 has been failing and this is the root cause.  We would like to ensure that the talos tests can run on VMs successfully so future development or needs to measure code coverage are met.

For this case, we need the KB installed on the win7-vm instances.
understood - sort of...

to me "vm" means the ec2 workertypes already online: gecko-t-win7-32 and gecko-t-win7-32-gpu (if these talos tests require a gpu). if that's the case, its just base amis that need to be updated with KB2999226.

if we're talking about the new hardware (not yet online), then it's gecko-t-win7-32-hw which means an mdt hack rather than base amis.

if we'd rather just have the kb installed everywhere (all tc win 7 instances), then its both mdt and base amis that need updating.

please clarify what's wanted/needed and i'll get started on the hacking.
we should have this on all VM, ec2 (exiting ones gecko-t-win7-32 and gecko-t-win7-32-gpu), + the new MDT machines.
Flags: needinfo?(pmoore)
(In reply to Robert Wood [:rwood] from comment #29)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=069452c0b36c729b5ddb7396ef66f0c555ce8983

Small patch that helps prepare talos for running on tc win vm. Will make it a bit faster when we're ready to switch over.
Comment on attachment 8923557 [details]
Bug 1369537 - preparing for talos on tc win vm;

https://reviewboard.mozilla.org/r/194674/#review199702

awesome- it would be nice to post a patch needed to get tests running- i.e. what remains after this lands.
Attachment #8923557 - Flags: review?(jmaher) → review+
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #32)
...
> awesome- it would be nice to post a patch needed to get tests running- i.e.
> what remains after this lands.

Thanks, good idea - here's what will be left to switch talos to the tc win vm.
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9b7557bb37c2
preparing for talos on tc win vm; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/9b7557bb37c2
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
You need to log in before you can comment on or make changes to this bug.