Make Mozharness downloads and unpacks actions handle better intermittent S3/EC2 issues

VERIFIED FIXED

Status

defect
VERIFIED FIXED
3 years ago
3 years ago

People

(Reporter: armenzg, Assigned: armenzg)

Tracking

unspecified

Firefox Tracking Flags

(firefox51 fixed)

Details

Attachments

(2 attachments)

We will use this bug to solve the problem and block the intermittent oranges which have been filed.
Blocks: 1301594
Blocks: 1301597
Blocks: 1301605
Blocks: 1300413
Blocks: 1300343
Blocks: 1300663
Blocks: 1300943
Blocks: 1301645
Hi philor, I will tackle each dependent bug in the order of worst offenders.
I might tackle more than one at a time.
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Attachment #8789845 - Flags: review?(gps)
Attachment #8789845 - Flags: review?(aki)
Comment hidden (mozreview-request)
Comment on attachment 8789845 [details]
Bug 1300812 - Make Mozharness downloads handle better intermittent S3/EC2 issues

This works, however, I'm more interested on feedback as I'm not completely satisfied on what I wrote.
Attachment #8789845 - Flags: review?(gps)
Attachment #8789845 - Flags: review?(aki)
Attachment #8789845 - Flags: feedback?(gps)
Attachment #8789845 - Flags: feedback?(aki)
Assignee

Comment 6

3 years ago
mozreview-review
Comment on attachment 8789845 [details]
Bug 1300812 - Make Mozharness downloads handle better intermittent S3/EC2 issues

https://reviewboard.mozilla.org/r/77904/#review76320

::: testing/mozharness/mozharness/base/script.py:500
(Diff revision 3)
> +            raise FetchedIncorrectFilesize(
> +                'The server file size is {} while we got instead {}'.format(
> +                    expected_file_size, obtained_file_size)
> +            )
> +
> +        compressed_file = StringIO(file_contents)

gps: should we simply passed compressed_file instead of file_object?

Perhaps it is better to completely separate the download logic from the unpacking.

We might be able to have two different retry calls with their own set of exceptions being handled.

::: testing/mozharness/mozharness/base/script.py:521
(Diff revision 3)
>                      if mode:
>                          os.chmod(fname, mode)
>  
>          except zipfile.BadZipfile as e:
>              self.exception('{}'.format(e.message))
> +            raise

aki: do I need to raise the exception if I'm not catching it inside of the retry?
Comment on attachment 8789845 [details]
Bug 1300812 - Make Mozharness downloads handle better intermittent S3/EC2 issues

> ::: testing/mozharness/mozharness/base/script.py:521
> (Diff revision 3)
> >                      if mode:
> >                          os.chmod(fname, mode)
> >  
> >          except zipfile.BadZipfile as e:
> >              self.exception('{}'.format(e.message))
> > +            raise
> 
> aki: do I need to raise the exception if I'm not catching it inside of the
> retry?

What do you want to have happen?
self.exception() will log the exception at level ERROR by default:
http://hg.mozilla.org/mozilla-central/file/tip/testing/mozharness/mozharness/base/log.py#l157

Raising it without catching it will kill the script.  A similar thing can happen by self.exception() with level=FATAL.  It looks like run() catches exceptions and runs the post-run cleanup/reporting steps anyway:
http://hg.mozilla.org/mozilla-central/file/tip/testing/mozharness/mozharness/base/script.py#l1930

So if you want the script to die at this point, raising it seems ok.  I think it's equivalent to self.exception(..., level=FATAL).
Attachment #8789845 - Flags: feedback?(aki) → feedback+
Posted file Test
Any suggestions on how to patch download_unpack()?
I want to test the case scenario where the content_length is bad once and the next retry will get the right content_length.

For now, I'm only dealing with file:// as I don't know how to mock http urls.
Nevermind my last comment. I'm making progress on writing the test.
Blocks: 1301802
Blocks: 1301807
Blocks: 1300953
Blocks: 1301855
Blocks: 1300625
No longer blocks: 1301605
Summary: Make Mozharness downloads handle better intermittent S3/EC2 issues → Make Mozharness downloads and unpacks actions handle better intermittent S3/EC2 issues
Comment hidden (mozreview-request)

Comment 11

3 years ago
mozreview-review
Comment on attachment 8789845 [details]
Bug 1300812 - Make Mozharness downloads handle better intermittent S3/EC2 issues

https://reviewboard.mozilla.org/r/77906/#review76684

Overall, I think this looks good.  Some history and comments:

The original idea behind the download and extract methods was to speed up local testing by saving the downloaded file to disk, so we wouldn't have to re-download every time we ran.  That would allow us to run other actions without necessarily needing the network... skip the download action, and the extract action would still work properly.

However, mozharness has definitely become more of a CI tool than a CI+local development tool, and we already changed the download + extract actions to download_and_extract or download_unpack here, so retrying the download to memory and extracting seems like the best solution if we continue in that direction.

One thing: we're assuming the file will always be a size that will fit into memory.  That may be a valid assumption forever, or we may hit files that we can't extract because of size.  If that ever becomes a concern, we could put in a check when we find the expected filesize; if it's larger than our max in-memory size, we can write to disk and treat it as a file.

::: testing/mozharness/mozharness/base/script.py:679
(Diff revision 4)
> -            function,
> -            kwargs=kwargs,
> +            self.fetch_url_into_memory,
> +            kwargs={'url': url},
>              **retry_args
>          )
>  
> +        if return_value == -1:

I think we'll never hit this because the above error_level is FATAL.  It doesn't hurt, though.

::: testing/mozharness/mozharness/base/script.py:691
(Diff revision 4)
> +            retry_args = dict(
> +                retry_exceptions=(
> +                    zipfile.BadZipfile
> +                ),
> +                error_message="Can't unpack the fetched file.",
> +                error_level=FATAL,

This hardcoded FATAL means we can't proceed without this file downloaded.  That's usually what we want, and we already have a hardcoded FATAL above.
Attachment #8789845 - Flags: review?(aki) → review+
Comment hidden (mozreview-request)
I addressed your comments.
I've also fixed an issue in the code which would the tests fail.
I'm re-testing on try.

Comment 14

3 years ago
mozreview-review
Comment on attachment 8789845 [details]
Bug 1300812 - Make Mozharness downloads handle better intermittent S3/EC2 issues

https://reviewboard.mozilla.org/r/77906/#review76734

This is fine as a refactor as it is better than the previous code because it won't retry download if that steps succeeds. However, I'm going to challenge that archive extraction needs to be retried. Archive extraction should be an deterministic operation. Given a consant input, it will either work or it won't. I'm not sure what retrying that specific step will accomplish other than add latency to the job eventually failing.

Since we retried archive extraction before, I'm fine giving this an r+. But please consider eliminating the retry (or possibly capping it to 1 retry) as a follow-up.

::: testing/mozharness/mozharness/base/script.py:398
(Diff revision 5)
> +
> +        self.info('Expected file size: {}'.format(expected_file_size))
> +        self.debug('Url: {}'.format(url))
> +        self.debug('Content-Encoding {}'.format(response.headers.get('Content-Encoding')))
> +
> +        file_contents = response.read()

I'm pretty sure this will result in 2x memory usage because we first buffer the entire HTTP response *then* make a copy into a BytesIO later. I don't think this will be a problem for our current use cases, however.

::: testing/mozharness/mozharness/base/script.py:409
(Diff revision 5)
> +                    expected_file_size, obtained_file_size)
> +            )
> +
> +        # Use BytesIO instead of StringIO
> +        # http://stackoverflow.com/questions/34162017/unzip-buffer-with-python/34162395#34162395
> +        return BytesIO(file_contents)

While it is correct to use ``io.BytesIO`` (``io.BytesIO`` represents binary data, ``io.StringIO`` represents Unicode), there is a caveat.

The ``StringIO.StringIO`` class (which only exists in Python 2) is kinda a hybrid class. And, it is backed by a more efficient implementation. In Python 2, the ``io.*IO`` classes are a bit slower than ``StringIO.StringIO``. For that reason, many people still cling to ``StringIO.StringIO`` on Python 2.

However, I /think/ we're OK here since the ``BytesIO`` instance is pre-allocated with the full size of the final data. If we were streaming into the ``BytesIO``, then we'd have to worry about perf.
Attachment #8789845 - Flags: review?(gps) → review+
Comment hidden (mozreview-request)
Good point about that second retry.
I dropped the need for it and make a direct call.

If we ever saw zipfile.BadZipfile it was because it was a bad download to begin with.

I will do one last try run and land tomorrow morning.

Thank you for the reviews.
Blocks: 1302237
Comment hidden (mozreview-request)

Comment 18

3 years ago
Pushed by armenzg@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4c2d01fe1c38
Make Mozharness downloads handle better intermittent S3/EC2 issues r=aki,gps
Blocks: 1302439
See Also: → 1272083

Comment 19

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/4c2d01fe1c38
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Blocks: 1303668
The remaining bugs that this bug 'blocks' will be dealt with on bug 1303759.

Both bugs shared those bugs as the same 'orange' signature.
Status: RESOLVED → VERIFIED
Blocks: 1305752
You need to log in before you can comment on or make changes to this bug.