Closed Bug 1237706 Opened 8 years ago Closed 8 years ago

Refactor download_unzip() for easier usage in mozharness

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(firefox45 fixed, firefox46 fixed, firefox-esr45 fixed)

RESOLVED FIXED
Tracking Status
firefox45 --- fixed
firefox46 --- fixed
firefox-esr45 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

(Depends on 1 open bug)

Details

Attachments

(4 files, 3 obsolete files)

I don't know why mozharness requires the external unzip command e.g. for extracting various test archives. Python has the ZipFile class which can be used for this purpose and would drop one external requirement on the box the scripts runs on.

So why is it used? Does anything speak against using the ZipFile class?

Not sure who can actually make the decision. Jordan, if you can't whom shall I needinfo?
Flags: needinfo?(jlund)
(In reply to Henrik Skupin (:whimboo) from comment #0)
> I don't know why mozharness requires the external unzip command e.g. for
> extracting various test archives. Python has the ZipFile class which can be
> used for this purpose and would drop one external requirement on the box the
> scripts runs on.
> 
> So why is it used? Does anything speak against using the ZipFile class?
> 
> Not sure who can actually make the decision. Jordan, if you can't whom shall
> I needinfo?

When aki started Mozharness, he was still learning python and particularly the libraries. I wouldn't say it was a conscious decision to not use ZipFile. This would be a pretty big change since so many jobs use unzip.

we've had issues before with unzip. mainly older versions builtin on platforms like osx[1][2]. So switching to ZipFile _may_ help with that. However, it also shows how fragile this is. I'm more than support making the change but we would want complete build and test coverage in try before rolling this out.

Thanks for trying to improve mh state here :)

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1066700#c3
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1032391
Flags: needinfo?(jlund)
That sounds complex and I might not have the time to do all of this. So I talked with Jordan on IRC and proposed him the following solution:

1. We add an option to allow the usage of ZipFile without changing the default which will still be the external unzip command.
2. I make use of ZipFile for Firefox UI Tests to qualify it across platforms
3. We can enable other code to use ZipFile step by step
4. Finally we can remove the requirement for unzip.

Jordan agreed on this proposal. So I will check next week what I can do.
So some code in mozbuild is using the JarWriter class of mozpack to create the zip files. This happens e.g. for the test packages. But those packages are not getting created optimized. So it should not be a problem for ZipFile to read the file. Not sure if there is any publicly uploaded file which actually got created in optimized mode. The only file I'm aware of where we are doing that is the omni.ja file inside of the Firefox folder. But files like those we are never going to extract with mozharness.

Using ZipFile instead of unzip might give us some performance loss in extracting the archives. I might wanna do some performance tests just to get a feeling about which rate we talk about here. Something else we might not need to care about are archives >4GB given that we might never reach such sizes.
The attachment shows a comparison between the performance of unzip and the internal ZipFile class on my Linux box with a SDD attached. The differences vary a lot and are about 1:1 up to 1:3.

I would suggest we keep the internal ZipFile class optional and use it in case no unzip exe file has been set via the config file.
Summary: mozharness should use Python's ZipFile instead of the external unzip command → mozharness should use Python's ZipFile in case unzip has not been set in the config file
Checking all the files which use unzip reveals a lot of duplicated code. Especially in Andoid land. Something great here would be to have a centralized method, which can be used to extract certain archive types. My first impression was to add this to the ScriptMixin class. While checking this file I found the already existent `unpack()` method:

https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1355

I think the unzip and ZipFile code would perfectly fit into this method.

I will take this bug!
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
This patch moves the `_download_unzip()` method from testbase to scripts given that it is more widely used then only for testing code. Also I have extended various parts of it so it should fit for all current calls (which I also have slightly adjusted).

Further we make use of this method now for downloading and extracting test urls, test package urls, and symbol urls.

All those changes should be non-invasive and should not cause any regression. For safety I pushed a try build:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=27469e21e9a1

Once this has been landed I will continue with the `unpack()` method.
Attachment #8707123 - Flags: review?(jlund)
Related to this, creating and appending files on Windows / NTFS is slow. Using simple multi-threaded I/O from Python, I'm able to best 7-zip.exe in terms of extract performance. And that's in spite of the Python GIL, which is notorious for killing multi-threaded Python performance. We should definitely write a unified extract() Python API that does the optimal thing depending on the platform and available utilities.
(In reply to Gregory Szorc [:gps] from comment #7)
> Using simple multi-threaded I/O from Python, I'm able to best 7-zip.exe in
> terms of extract performance. And that's in spite of the Python GIL, which
> is notorious for killing multi-threaded Python performance. We should
> definitely write a unified extract() Python API that does the optimal thing
> depending on the platform and available utilities.

Given that we have lots of IO in this case and that under those situations the GIL is not locked for the other thread (as how I read this now on various sites) it would make sense. So do you mean that in case no system tool is available to extract a given archive we should fall back to the internal classes like ZipFile and run the extraction via multiple threads? So extractall() vs. extract() with a divided list of files? 

If you have specific web pages which give an insight in those details I would kinda be interested in.
Comment on attachment 8707123 [details] [diff] [review]
Refactor download_unzip and move to scripts.py (v1)

Review of attachment 8707123 [details] [diff] [review]:
-----------------------------------------------------------------

looks great! if try is happy, so am I. thanks for this great cleanup.

::: testing/mozharness/mozharness/base/script.py
@@ +468,5 @@
> +
> +        """
> +        dirs = self.query_abs_dirs()
> +        zipfile = self.download_file(url, parent_dir=dirs['abs_work_dir'],
> +                                     error_level=FATAL)

iiuc - this will make download_unzip's halt_on_failure param ignored

::: testing/mozharness/mozharness/mozilla/testing/firefox_ui_tests.py
@@ +139,5 @@
>  
> +        # base_path = '/mozilla/code/gecko/testing/mozharness'
> +        # self.unpack(os.path.join(base_path, 'a', 'info.tar.gz'), os.path.join(base_path, 'b'))
> +        # self.unpack(os.path.join(base_path, 'a', 'info.tar.bz2'), os.path.join(base_path, 'b'))
> +        # self.unpack(os.path.join(base_path, 'a', 'info.zip'), os.path.join(base_path, 'b'))

debug lines?

::: testing/mozharness/mozharness/mozilla/testing/testbase.py
@@ -389,5 @@
> -        # create a proxxy object and get the binaries from it
> -        source = self.download_file(self.test_url, file_name=file_name,
> -                                    parent_dir=dirs['abs_work_dir'],
> -                                    error_level=FATAL)
> -        self.test_zip_path = os.path.realpath(source)

sanity check, I notice you don't save this, self.test_zip_path, anymore. Is it needed anywhere?
Attachment #8707123 - Flags: review?(jlund) → review+
Comment on attachment 8707123 [details] [diff] [review]
Refactor download_unzip and move to scripts.py (v1)

Review of attachment 8707123 [details] [diff] [review]:
-----------------------------------------------------------------

Thank you for the review! And yes, the try run was successful. No single failure was detected. I will fix the nit with the debug code. Otherwise see my replies inline. I don't think we need updates for any of those.

::: testing/mozharness/mozharness/base/script.py
@@ +468,5 @@
> +
> +        """
> +        dirs = self.query_abs_dirs()
> +        zipfile = self.download_file(url, parent_dir=dirs['abs_work_dir'],
> +                                     error_level=FATAL)

`download_file()` does not have such a parameter. I think we only support that for external processes.

::: testing/mozharness/mozharness/mozilla/testing/firefox_ui_tests.py
@@ +139,5 @@
>  
> +        # base_path = '/mozilla/code/gecko/testing/mozharness'
> +        # self.unpack(os.path.join(base_path, 'a', 'info.tar.gz'), os.path.join(base_path, 'b'))
> +        # self.unpack(os.path.join(base_path, 'a', 'info.tar.bz2'), os.path.join(base_path, 'b'))
> +        # self.unpack(os.path.join(base_path, 'a', 'info.zip'), os.path.join(base_path, 'b'))

Ouch! Indeed. I blame `git commit -a` here. I will remove that.

::: testing/mozharness/mozharness/mozilla/testing/testbase.py
@@ -389,5 @@
> -        # create a proxxy object and get the binaries from it
> -        source = self.download_file(self.test_url, file_name=file_name,
> -                                    parent_dir=dirs['abs_work_dir'],
> -                                    error_level=FATAL)
> -        self.test_zip_path = os.path.realpath(source)

No, there are two instances in the whole mozharness code which are here in that file. Once in that method and then in `_extract_test_zip()`. So it was only used to share the name of the downloaded file between those two methods.
Removed the debug lines, which went in accidentally. As mentioned in my last comment nothing else would be necessary here. So this is the final patch.
Attachment #8707123 - Attachment is obsolete: true
Attachment #8707590 - Flags: review+
Attachment #8707590 - Flags: checked-in+
Attached patch Enhancements for unpack() (v1) (obsolete) — Splinter Review
This patch is the next step in regards to use ZipFile soon. It adds the following enhancements:

* Renaming of download_unzip() to download_unpack() to handle all kinds of known archive types (tar.gz, tar.bz2, zip), and adjusting the parameters
* Enhancements to unpack()
** Adds zip support
** Adds support to only extract specific folders/files from a tar.gz and tar.bz2 (similar to the already existent feature for zip files)
** Removal of the verbose mode for tar files to be in sync with zip (if you think verbose is necessary we have to add another parameter to the function)
** Throws an exception now for unknown file types
** Using known error lists for tar and zip archives
* Updating various files for the above changes

I did a try run for those changes:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=58dfd597a1c0
Attachment #8707792 - Flags: review?(jlund)
Attached patch Enhancements for unpack() (v1.1) (obsolete) — Splinter Review
Updated patch now without the debugging lines in firefox-ui-tests.
Attachment #8707792 - Attachment is obsolete: true
Attachment #8707792 - Flags: review?(jlund)
Attachment #8707798 - Flags: review?(jlund)
Keywords: leave-open
While the above patch is still waiting for review I continued here and got fallback code for both tar and zip added. I pushed a changeset to try to check how timing is different for fallbacks on real releng machines.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe35642647e4
(In reply to Henrik Skupin (:whimboo) from comment #16)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe35642647e4

That try run has a couple of oranges with some showing permission denied. I don't see any bug filed about this issue yet and also nothing like that appears for other try builds or on inbound. So I assume it is somewhat related to my changes.

I proposed to keep the fallback for now if the external tool is not available. The step to kill unzip can be done with another patch. In the meantime we could test everything with firefox-ui-tests on our own machines, and fix issues as they appear.
The following treeherder job shows results for a local run of firefox-ui-tests on a Windows machine which does not have unzip/tar installed:

https://treeherder.allizom.org/logviewer.html#?job_id=2705713&repo=mozilla-central
This patch adds the fallbacks for both tarfile and zipfile in case the external tools are not available. To make it work I had to remove the `halt_on_failure` parameters for the `run_command()` calls, so that we now can check the retval and in case of -1 fallback to the internal extraction method.

Given that my former try runs which explicitly forced tarfile and zipfile are somewhat broken (some tests are failing including permission failures) I won't make it the default yet. But once this patch has been landed we can check how our firefox-ui-tests behave, fix possible failures and do a complete try run with forced internal extraction. Those test suites which are passing we could then adapt, and other would need more investigation.
Attachment #8708255 - Flags: review?(jlund)
I should have said that we'll *eventually* want to write a unified extract API that does things in the most optimal way possible. https://bz.mercurial-scm.org/show_bug.cgi?id=4889 has some info on the slow file closing problem on Windows. https://www.selenic.com/pipermail/mercurial-devel/2016-January/078406.html shows the impact of implementing background file closing for writing 200,000+ files as part of `hg clone`.
Comment on attachment 8707798 [details] [diff] [review]
Enhancements for unpack() (v1.1)

Review of attachment 8707798 [details] [diff] [review]:
-----------------------------------------------------------------

looks good to me from what I can see :)

::: testing/mozharness/mozharness/base/script.py
@@ +1400,4 @@
>          m = re.search('\.tar\.(bz2|gz)$', filename)
>          if m:
> +            # tar does not automatically create the target folder
> +            if not os.path.exists(extract_to):

should we ensure extract_to is a directory?

@@ +1406,2 @@
>              command = self.query_exe('tar', return_type='list')
> +            tar_args = 'zxf' if m.group(1) == 'gz' else 'jxf'

looks like you remove verbosity. it might be nice to have it as an option. particularly for smaller archives or for debugging

@@ +1433,2 @@
>          else:
> +            raise NotImplementedError('No extraction method found for: %s' % filename)

++
Attachment #8707798 - Flags: review?(jlund) → review+
Comment on attachment 8708255 [details] [diff] [review]
Add fallback to tarfile and zipfile (v1)

Review of attachment 8708255 [details] [diff] [review]:
-----------------------------------------------------------------

I have some questions in this one. setting to r- for now. please re-request when you have addressed concerns with either comments or a new patch.

::: testing/mozharness/mozharness/base/script.py
@@ +1415,5 @@
>              if not os.path.exists(extract_to):
>                  os.makedirs(extract_to)
>  
>              command = self.query_exe('tar', return_type='list')
> +            tar_args = 'zxf' if filename.endswith('gz') else 'jxf'

hmm, did the group(1) not work?

@@ +1423,5 @@
>                  command.extend(extract_dirs)
> +            retval = self.run_command(command,
> +                                      error_level=WARNING,
> +                                      error_list=TarErrorList,
> +                                      fatal_exit_code=3,

I think fatal_exit_code can disappear too if halt_on_failure is gone.

@@ -1411,5 @@
>                  command.extend(extract_dirs)
> -            self.run_command(command,
> -                             error_list=TarErrorList,
> -                             fatal_exit_code=3,
> -                             halt_on_failure=halt_on_failure,

should we remove halt_on_failure from method sig?

@@ +1427,5 @@
> +                                      fatal_exit_code=3,
> +                                      )
> +
> +            # In case of an OS error (tar not available) fallback to TarFile
> +            if retval == -1:

looks like it is possible for retval to != -1. e.g. if run_command doesn't throw an exception but tar subprocess cmd returns a non 0.

iiuc, it may be better to check `if retval` (a none Falsey value).

@@ +1449,5 @@
> +                                      success_codes=[0, 11],
> +                                      )
> +
> +            # In case of an OS error (unzip not available) fallback to ZipFile
> +            if retval == -1:

same applies here except we should also accept 11

@@ +1457,5 @@
> +                        bundle.extract(name, path=extract_to)
> +
> +                        # ZipFile doesn't preserve permissions: http://bugs.python.org/issue15795
> +                        filename = os.path.realpath(os.path.join(extract_to, name))
> +                        mode = bundle.getinfo(name).external_attr >> 16 & 0x1FF

interesting.. have you tested this? e.g. forced this code to run?
Attachment #8708255 - Flags: review?(jlund) → review-
over irc we also talked about the requirement for having tests for these changes. motivation being this is a core change, touches everywhere, and would be beneficial for future changes going forward. But we agreed tests can come after the implementation is finished.
Comment on attachment 8708255 [details] [diff] [review]
Add fallback to tarfile and zipfile (v1)

Review of attachment 8708255 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozharness/mozharness/base/script.py
@@ +1414,5 @@
>              # tar does not automatically create the target folder
>              if not os.path.exists(extract_to):
>                  os.makedirs(extract_to)
>  
>              command = self.query_exe('tar', return_type='list')

another thought. if we mainly care about the presence of unzip/tar, we could instead use self.which() here and, iirc, it will also check for the existance of the program. so if not found, we could fall back to python's implementation.
I don't think we should ever use the installed unzip on the Mac. The versions that ship with all versions of OS X will fail to download files more than 2 GB. unzip 6.0 fixes this, but acquiring this is difficult. We no longer have that many files that size, but our tests.zip files used to be.
(In reply to Syd Polk :sydpolk from comment #25)
> longer have that many files that size, but our tests.zip files used to be.

That's why the requirement of version 6.0 for unzip:

http://mxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/testbase.py#372
Updated patch with the tar verbosity added back only. If you disagree with my statement about the isdir() check please let me know and I will rework it in my next patch.

Taking over r+ from last revision of the patch.
Attachment #8707798 - Attachment is obsolete: true
Attachment #8709100 - Flags: review+
Attachment #8709100 - Flags: checked-in+
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/c912bbffbbce for Android Talos bustage.

Push with failing jobs: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a5472dd5fafd
Talos failure example: https://treeherder.mozilla.org/logviewer.html#?job_id=19992611&repo=mozilla-inbound

09:50:37     INFO - proxxy config: {'regions': ['.use1.', '.usw2.', '.scl3'], 'instances': ['proxxy1.srv.releng.use1.mozilla.com', 'proxxy1.srv.releng.usw2.mozilla.com', 'proxxy1.srv.releng.scl3.mozilla.com'], 'urls': [('http://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp.mozilla.org', 'ftp.mozilla.org'), ('https://ftp-ssl.mozilla.org', 'ftp.mozilla.org'), ('http://pvtbuilds.pvt.build.mozilla.org', 'pvtbuilds.mozilla.org'), ('http://pypi.pvt.build.mozilla.org', 'pypi.pvt.build.mozilla.org'), ('http://pypi.pub.build.mozilla.org', 'pypi.pub.build.mozilla.org'), ('https://queue.taskcluster.net', 'queue.taskcluster.net')]}
09:50:37     INFO - https://queue.taskcluster.net/v1/task/aBgvYSk3S0OH0yIRGqI1jA/artifacts/public/build/fennec-46.0a1.en-US.android-arm.apk matches https://queue.taskcluster.net
09:50:37     INFO - URL Candidate: http://queue.taskcluster.net.proxxy1.srv.releng.scl3.mozilla.com/v1/task/aBgvYSk3S0OH0yIRGqI1jA/artifacts/public/build/fennec-46.0a1.en-US.android-arm.apk
09:50:37     INFO - trying http://queue.taskcluster.net.proxxy1.srv.releng.scl3.mozilla.com/v1/task/aBgvYSk3S0OH0yIRGqI1jA/artifacts/public/build/fennec-46.0a1.en-US.android-arm.apk
09:50:37     INFO - mkdir: /builds/panda-0573/test/build
09:50:37     INFO - Downloading http://queue.taskcluster.net.proxxy1.srv.releng.scl3.mozilla.com/v1/task/aBgvYSk3S0OH0yIRGqI1jA/artifacts/public/build/fennec-46.0a1.en-US.android-arm.apk to /builds/panda-0573/test/build/fennec-46.0a1.en-US.android-arm.apk
09:50:37     INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'http://queue.taskcluster.net.proxxy1.srv.releng.scl3.mozilla.com/v1/task/aBgvYSk3S0OH0yIRGqI1jA/artifacts/public/build/fennec-46.0a1.en-US.android-arm.apk', 'file_name': '/builds/panda-0573/test/build/fennec-46.0a1.en-US.android-arm.apk'}, attempt #1
09:50:41     INFO - Downloaded 42565462 bytes.
09:50:41     INFO - Running post-action listener: _resource_record_post_action
09:50:41     INFO - Running post-action listener: set_extra_try_arguments
09:50:41    FATAL - Uncaught exception: Traceback (most recent call last):
09:50:41    FATAL -   File "/builds/panda-0573/test/scripts/mozharness/base/script.py", line 1782, in run
09:50:41    FATAL -     self.run_action(action)
09:50:41    FATAL -   File "/builds/panda-0573/test/scripts/mozharness/base/script.py", line 1724, in run_action
09:50:41    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
09:50:41    FATAL -   File "/builds/panda-0573/test/scripts/mozharness/base/script.py", line 1664, in _possibly_run_method
09:50:41    FATAL -     return getattr(self, method_name)()
09:50:41    FATAL -   File "scripts/scripts/android_panda_talos.py", line 303, in download_and_extract
09:50:41    FATAL -     dirs['abs_fennec_dir'])
09:50:41    FATAL -   File "/builds/panda-0573/test/scripts/mozharness/base/script.py", line 474, in download_unpack
09:50:41    FATAL -     self.unpack(archive, parent_dir, extract_dirs, halt_on_failure)
09:50:41    FATAL -   File "/builds/panda-0573/test/scripts/mozharness/base/script.py", line 1434, in unpack
09:50:41    FATAL -     raise NotImplementedError('No extraction method found for: %s' % filename)
09:50:41    FATAL - NotImplementedError: No extraction method found for: /builds/panda-0573/test/build/fennec-46.0a1.en-US.android-arm.apk
09:50:41    FATAL - Running post_fatal callback...
09:50:41     INFO - request_url doesn't exist. Already closed?
09:50:41    FATAL - Exiting -1
Flags: needinfo?(hskupin)
The problem here is that when I combined every call to `unzip` into `unpack()` I missed that specifically only Talos tests are expanding apk files. So I totally missed that single Android API11 platform in combination with Talos. :/

We will leave the current tests running so that we can see if something else is broken too (which I don't think). I will then provide an updated patch, including a try server run for the missed platform and tests.
Flags: needinfo?(hskupin)
Ok, so the problem here is that I actually pushed a full try with both of the patches applied. That means that in the above cases we do not check for an extension name but directly test if the file is an archive of a specific type! As it can be seen on TH even more jobs were failing which fall into the B2G product category, whereby the files for extraction do even not contain an extension.

I would propose that we combine those two remaining patches into a single one for landing. That means I will fix all outstanding issues and request review from Jordan again.

Sorry for those troubles!
Attachment #8709100 - Flags: checked-in+
(In reply to Jordan Lund (:jlund) from comment #24)
> >              command = self.query_exe('tar', return_type='list')
> 
> another thought. if we mainly care about the presence of unzip/tar, we could
> instead use self.which() here and, iirc, it will also check for the
> existance of the program. so if not found, we could fall back to python's
> implementation.

That works actually pretty good, and is totally my preferred method now. Thanks for pointing that method out.

(In reply to Jordan Lund (:jlund) from comment #22)
> >              command = self.query_exe('tar', return_type='list')
> > +            tar_args = 'zxf' if filename.endswith('gz') else 'jxf'
> 
> hmm, did the group(1) not work?

Formerly we were using the regex in two lines. Now that is no longer the case, given that we properly check with `is_tarfile()`. Given that regexp are expensive, a simple string comparison should be enough or?

> > +            retval = self.run_command(command,
> > +                                      error_level=WARNING,
> > +                                      error_list=TarErrorList,
> > +                                      fatal_exit_code=3,
> 
> I think fatal_exit_code can disappear too if halt_on_failure is gone.

Given the usage of `self.which()` we have to leave the original calling convention in place. So I will revert it appropriately.

> > +                                      fatal_exit_code=3,
> > +                                      )
> > +
> > +            # In case of an OS error (tar not available) fallback to TarFile
> > +            if retval == -1:
> 
> looks like it is possible for retval to != -1. e.g. if run_command doesn't
> throw an exception but tar subprocess cmd returns a non 0.
> 
> iiuc, it may be better to check `if retval` (a none Falsey value).

This is no longer used, and we no longer have a silent fallback. That's definitely the correct approach, so that we can add jobs step by step using the internal classes.

> > +                        # ZipFile doesn't preserve permissions: http://bugs.python.org/issue15795
> > +                        filename = os.path.realpath(os.path.join(extract_to, name))
> > +                        mode = bundle.getinfo(name).external_attr >> 16 & 0x1FF
> 
> interesting.. have you tested this? e.g. forced this code to run?

This code actually comes from mozfile, and I have properly tested it. The first version of the patch didn't contain it, and was failing a lot with permission issues. But given that there were still some left in my last try run for this patch, I will have another closer look.
As I have noticed for our firefox ui tests we have a mismatch in the exit code of build bot and what treeherder assumes. See https://github.com/mozilla/mozmill-ci/issues/716

Jordan, by default `run_command()` uses an exit code of 2 in case of failures. Within testbase.py we set an exit code of 3. Is there any reason for that? Why can't we always exit with 2 which would mean failure? In case of 3 we have the mismatch between exception and skipped, so that jobs on Treeherder could be hidden.
Flags: needinfo?(jlund)
Jordan, I'm also struggling a bit regarding the calls to `self.fatal()` vs. `raise`. Is there a rule when using what? Should internal code raise while run_command() calls will exit with fatal if halt_on_error is set? Given that we have both cases in the same method now, it's a bit confusing.
Summary: mozharness should use Python's ZipFile in case unzip has not been set in the config file → mozharness should use Python's ZipFile/TarFile in case unzip/tar commands are not available
(In reply to Henrik Skupin (:whimboo) from comment #33)
> As I have noticed for our firefox ui tests we have a mismatch in the exit
> code of build bot and what treeherder assumes. See
> https://github.com/mozilla/mozmill-ci/issues/716
> 
> Jordan, by default `run_command()` uses an exit code of 2 in case of
> failures. Within testbase.py we set an exit code of 3. Is there any reason
> for that? Why can't we always exit with 2 which would mean failure? In case
> of 3 we have the mismatch between exception and skipped, so that jobs on
> Treeherder could be hidden.

there are a few things here we have to keep in mind:
* the return code and log level of mozharness. Which doesn't have to be conscious of Mozilla specific things
* the return code we bubble up to buildbot and then to treeherder.

run_command() is generic and its defaults should not be aligned with mozilla's expectation. We should be explicit in the call when we want to set the return code to something that buildbot/treeherder interperets.

So saying all that: historically we would set self.return_code or the exit value of the mozharness call to 3 so that buildbot would treat that as an EXCEPTION. Which would mean an error on our infra, not in the code it was building/testing. But iiuc, we are moving away from that in favor of turning the build orange (1) if it's releng/automation fault and failure (2) if it's the build/test fault. So not using 3. 

the default error level of 2 seems reasonable. do you think it should change?

(In reply to Henrik Skupin (:whimboo) from comment #34)
> Jordan, I'm also struggling a bit regarding the calls to `self.fatal()` vs.
> `raise`. Is there a rule when using what? Should internal code raise while
> run_command() calls will exit with fatal if halt_on_error is set? Given that
> we have both cases in the same method now, it's a bit confusing.

I think the answer is, it depends. If we want to fatal on an exception being raised, we should use halt_on_failure=True. When we want to catch exceptions, do some clean up, or retry(run_command), we should allow to raise exceptions and have halt_on_failure=False.

Does that make sense? Am I answering your question? Is there a way we can improve this?
Flags: needinfo?(jlund)
Blocks: 1241061
(In reply to Henrik Skupin (:whimboo) from comment #32)
> > > +                        # ZipFile doesn't preserve permissions: http://bugs.python.org/issue15795
> > > +                        filename = os.path.realpath(os.path.join(extract_to, name))
> > > +                        mode = bundle.getinfo(name).external_attr >> 16 & 0x1FF
> > 
> > interesting.. have you tested this? e.g. forced this code to run?
> 
> This code actually comes from mozfile, and I have properly tested it. The
> first version of the patch didn't contain it, and was failing a lot with
> permission issues. But given that there were still some left in my last try
> run for this patch, I will have another closer look.

Looks like this is broken when I extract a tests.zip file for Windows on a Linux machine. All files have no permissions and tests will fail. I have seen this because of bug 1239808. When I remove changing the permissions it all works fine. That means that more testing is necessary here across platforms.
Depends on: 1242592
Depends on: 1243318
Sorry for the delay here but I had to finish some other goals first. I will work again on this patch the next days.
So attachment 8707590 [details] [diff] [review] got landed on mozilla-central (see comment 12: https://hg.mozilla.org/integration/mozilla-inbound/rev/5a980da4afb5) while we were on 46.0, but actually was never backported to aurora 45.0. As result it is causing total bustage of our firefox-ui-tests on Windows due to the missing unzip executable. This affects current candidate builds on mozilla-release and mozilla-esr45. 

Given that attachment 8707590 [details] [diff] [review] is test-only and didn't cause any regressions I would request a backport to mozilla-release and mozilla-esr45.
Whiteboard: [checkin-needed-release][checkin-needed-esr45]
(In reply to Sebastian H. [:aryx][:archaeopteryx] from comment #29)
> Backed out in
> https://hg.mozilla.org/integration/mozilla-inbound/rev/c912bbffbbce for
> Android Talos bustage.

I haven't followed this bug closely, but if the problem here was specific to Android Talos, there's good news: We no longer run Android Talos! (Similar jobs now run in autophone, as Tier 3 -- maybe worth checking too.)
Is it possible that the landed changes here have introduced intermittent hangs during/after tooltool fetches?

Bug 1051908 collects Android mozharness-script-level timeouts. There were almost no such failures files in January or February, but there have been some intermittent timeouts in the last couple of weeks.

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1051908&startday=2016-01-01&endday=2016-03-16&tree=all

Logs suggest tooltool fetch or unzip is implicated:

https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=7957720

 00:31:47     INFO - Running command: ['/tools/tooltool.py', '--url', 'https://api.pub.build.mozilla.org/tooltool/', '--authentication-file', '/builds/relengapi.tok', 'fetch', '-m', '/builds/slave/test/build/tooltool.tt', '-o', '-c', '/builds/tooltool_cache'] in /builds/slave/test/build
 00:31:47     INFO - Copy/paste: /tools/tooltool.py --url https://api.pub.build.mozilla.org/tooltool/ --authentication-file /builds/relengapi.tok fetch -m /builds/slave/test/build/tooltool.tt -o -c /builds/tooltool_cache
 00:31:47     INFO -  INFO - File android-sdk_r24.0.2-linux.tgz not present in local cache folder /builds/tooltool_cache
 00:31:47     INFO -  INFO - Attempting to fetch from 'https://api.pub.build.mozilla.org/tooltool/'...
 
 command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/android_emulator_unittest.py', '--cfg', 'android/androidarm_4_3.py', '--test-suite', 'mochitest-gl-3', '--blob-upload-branch', 'fx-team', '--download-symbols', 'ondemand'], attempting to kill
 process killed by signal 9
 program finished with exit code -1
 elapsedTime=2427.650877


I wonder if intermittent network or tooltool server-side failures were handled differently before, perhaps with a faster timeout, more specific error message, and/or retries?

One detail that jumps out at me is that the old _download_unzip used an output_timeout: http://hg.mozilla.org/mozilla-central/rev/5a980da4afb5#l3.77, but I don't see any timeout in the new code.
Whiteboard: [checkin-needed-release][checkin-needed-esr45]
:whimboo is this fixed?
Flags: needinfo?(hskupin)
(In reply to Geoff Brown [:gbrown] from comment #40)
> I haven't followed this bug closely, but if the problem here was specific to
> Android Talos, there's good news: We no longer run Android Talos! (Similar
> jobs now run in autophone, as Tier 3 -- maybe worth checking too.)

If there is a problem with Android Talos tests it will not be a problem of this bug. The backout of the patch which happened via comment 29 has never landed again yet. So it's not related. I would suggest to file a new bug for it with more details including since when it is missing.

(In reply to Justin Wood (:Callek) [back on Mar 21] from comment #42)
> :whimboo is this fixed?

Not really. It's actually in a non-completed state. What has been landed so far is attachment 8707590 [details] [diff] [review] and that is also in Firefox 45. The remaining two patches still have to be updated and landed. Sadly I haven't had the time to work on this bug lately given release promotion and other breaking changes for us. Would you propose that we move out the remaining work to another bug?
Flags: needinfo?(hskupin)
(In reply to Henrik Skupin (:whimboo) from comment #43)
> Would you propose that we move out the remaining work to another bug?

Probably worth it
(In reply to Henrik Skupin (:whimboo) from comment #43)
> (In reply to Geoff Brown [:gbrown] from comment #40)
> > I haven't followed this bug closely, but if the problem here was specific to
> > Android Talos, there's good news: We no longer run Android Talos! (Similar
> > jobs now run in autophone, as Tier 3 -- maybe worth checking too.)
> 
> If there is a problem with Android Talos tests it will not be a problem of
> this bug. The backout of the patch which happened via comment 29 has never
> landed again yet. So it's not related. I would suggest to file a new bug for
> it with more details including since when it is missing.

Sorry if I wasn't clear. I was just trying to be helpful by pointing out that the tests that caused the backout in comment 29 have been changed/moved: "Android 4.0 opt" Talos tests do not run any more (intentionally), so that patch may not cause failures if landed today. There's nothing missing or currently failing.
Except...

(In reply to Geoff Brown [:gbrown] from comment #41)
> Is it possible that the landed changes here have introduced intermittent
> hangs during/after tooltool fetches?

which was referencing https://hg.mozilla.org/mozilla-central/rev/5a980da4afb5, which I believe landed and was not backed out.
See Also: → 1256300
The landed patch was only a refactoring of current code and I didn't touch tooltool. Looking at the pointed out bug I do not see an increase of reported timeouts around the date the mentioned patch got landed. There were actually 2 months(!) without any reported failure. I really do not see a relationship here.

(In reply to Justin Wood (:Callek) [back on Mar 21] from comment #44)
> (In reply to Henrik Skupin (:whimboo) from comment #43)
> > Would you propose that we move out the remaining work to another bug?
> 
> Probably worth it

Ok, will shortly do it.
Blocks: 1258539
Updating bug so it only covers the refactoring of download_unzip(). New bug for the remaining work has been filed as bug 1258539.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Keywords: leave-open
Resolution: --- → FIXED
Summary: mozharness should use Python's ZipFile/TarFile in case unzip/tar commands are not available → Refactor download_unzip() for easier usage in mozharness
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: