Closed
Bug 1358142
Opened 8 years ago
Closed 8 years ago
Support Content-Encoding header when downloading from urls
Categories
(Release Engineering :: Applications: MozharnessCore, defect)
Release Engineering
Applications: MozharnessCore
Tracking
(firefox55 fixed)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox55 | --- | fixed |
People
(Reporter: pmoore, Assigned: pmoore)
References
Details
Attachments
(1 file, 2 obsolete files)
12.15 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
When adding support in taskcluster generic-worker for publishing gzip-encoded artifacts in bug 1347956, I discovered that mozharness is still using urllib2 which in the deployed version, amazingly still doesn't natively respect the Content-Encoding header defined in the HTTP/1.1 protocol defined in RFC 2616.
I have written a patch to fix this in mozharness, which I'm currently running in this try push:
https://treeherder.allizom.org/#/jobs?repo=try&revision=3b4f51c00db45603c243451a44d5715b7d00f020
There are two different download mechanisms in:
* testing/mozharness/mozharness/base/script.py
I fixed this in both:
* ScriptMixin.fetch_url_into_memory
* ScriptMixin._download_file
Note, I think currently only the _download_file method is used for fetching artifacts (I'm not 100% sure), but it is no doubt useful to handle this in both places, since Content-Encoding is a standard http header, and fetch_url_into_memory can only benefit from supporting it, if at any point in future the urls it hits start using this header.
Comment 1•8 years ago
|
||
I believe there is a copy of requests in the repo - any chance of switching to that instead?
Assignee | ||
Comment 2•8 years ago
|
||
I wanted to keep the delta small, especially as this isn't an area of expertise for me - also this is blocking roll out of other g-w fixes that are blocking testers on win10 at the moment, so I had to choose between simple patching, or backing out entirely - but we can create a separate bug to move over to requests library if you like. The change I made here seems to work locally, so I'd prefer not to rewrite it all at this time, but leave that as something for the future when other priorities are not in the way.
Assignee | ||
Comment 3•8 years ago
|
||
(when i say back out entirely - I mean I made a generic-worker release with critical fixes in it, *plus* the stuff from
1347956, and 1347956 depends on this bug - so I'd have to pull out the stuff in 1347956 from generic-worker, and make a new release of it, and roll that out, in order to unblock win10 testing - so I chose the route of a simple fix in mozharness to avoid the need for new generic-worker releases etc)
Assignee | ||
Comment 4•8 years ago
|
||
Assignee | ||
Comment 5•8 years ago
|
||
Aki, I renamed FetchedIncorrectFilesize to ContentLengthMismatch as I thought this was clearer that the amount of data retrieved from the url was not consistent with the Content-Length header received, rather than this being user data - "I expect the downloaded file to be of this size".
After I made the change I noticed that there is an edge case where something might use the fetch_url_into_memory to read a file - in which case the content length of a file url technically doesn't come from a Content-Length header. However, I think this is an edge case.
The only place I see this might be a problem is:
> 'The retrieved Content-Length header declares a body length of {} bytes, while we actually retrieved {} bytes'
I guess I could generate an error based on the url scheme, but it should never be possible that if os.stat() says the file size is something, and after reading the file, we have a different number of bytes (if so, we have a much more serious problem with the worker), so I don't think this is an edge case we need to be particularly concerned with - we probably don't need to validate file size for file urls at all, but I left the code as it is, to avoid increasing the delta of this change.
Assignee | ||
Comment 6•8 years ago
|
||
Also note ... I couldn't find a trivial way to gunzip the content without either reading it all into memory, or downloading it as a temp file, so I went for the temp file option. If you know a simple way to gunzip on the fly without reading everything into memory, I'm all ears. From what I could see, in order to decompress, the compression library needs to seek through the data, and of course you can't easily seek back and forth on an http response body, hence the reason to write to a compressed file first, and then decompress that. If you know of something though, and have a working example, let me know.
Assignee | ||
Comment 7•8 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
> Also note ... I couldn't find a trivial way to gunzip the content without
> either reading it all into memory, or downloading it as a temp file, so I
> went for the temp file option. If you know a simple way to gunzip on the fly
> without reading everything into memory, I'm all ears. From what I could see,
> in order to decompress, the compression library needs to seek through the
> data, and of course you can't easily seek back and forth on an http response
> body, hence the reason to write to a compressed file first, and then
> decompress that. If you know of something though, and have a working
> example, let me know.
To demonstrate the problem I had with compressing on-the-fly:
pmoore@Petes-iMac:~/hg/mozilla-central/testing/mozharness $ python
Python 2.7.13 (default, Dec 17 2016, 23:03:43)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> handle = urllib2.urlopen('https://queue.taskcluster.net/v1/task/Rfdymk6_RyeSCVmBGdnsMA/artifacts/public/build/firefox-55.0a1.en-US.win32.test_packages.json', timeout=30)
>>> import gzip
>>> gzipFile = gzip.GzipFile(fileobj=handle)
>>> x = gzipFile.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 261, in read
self._read(readsize)
File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 295, in _read
pos = self.fileobj.tell() # Save current position
AttributeError: addinfourl instance has no attribute 'tell'
Comment 8•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #1)
> I believe there is a copy of requests in the repo - any chance of switching
> to that instead?
We run a lot of automation by grabbing a zip of mozharness and running without any venv setup, so this change might break a lot of things.
Comment 9•8 years ago
|
||
Comment on attachment 8860050 [details] [diff] [review]
bug1358142_gecko_v1.patch
I think this looks good overall. This is pretty core, so don't be surprised if something unexpected breaks. I'm going to respond to some of your other comments in-bug.
Attachment #8860050 -
Flags: review?(aki) → review+
Comment 10•8 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #5)
> Aki, I renamed FetchedIncorrectFilesize to ContentLengthMismatch as I
> thought this was clearer that the amount of data retrieved from the url was
> not consistent with the Content-Length header received, rather than this
> being user data - "I expect the downloaded file to be of this size".
>
> After I made the change I noticed that there is an edge case where something
> might use the fetch_url_into_memory to read a file - in which case the
> content length of a file url technically doesn't come from a Content-Length
> header. However, I think this is an edge case.
>
> The only place I see this might be a problem is:
>
> > 'The retrieved Content-Length header declares a body length of {} bytes, while we actually retrieved {} bytes'
>
> I guess I could generate an error based on the url scheme, but it should
> never be possible that if os.stat() says the file size is something, and
> after reading the file, we have a different number of bytes (if so, we have
> a much more serious problem with the worker), so I don't think this is an
> edge case we need to be particularly concerned with - we probably don't need
> to validate file size for file urls at all, but I left the code as it is, to
> avoid increasing the delta of this change.
This makes sense. Otherwise we could declare error_msg = "The retrieved Content-Length ..." and override in the if block, and error_msg.format(...) on error.
Comment 11•8 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
> Also note ... I couldn't find a trivial way to gunzip the content without
> either reading it all into memory, or downloading it as a temp file, so I
> went for the temp file option. If you know a simple way to gunzip on the fly
> without reading everything into memory, I'm all ears. From what I could see,
> in order to decompress, the compression library needs to seek through the
> data, and of course you can't easily seek back and forth on an http response
> body, hence the reason to write to a compressed file first, and then
> decompress that. If you know of something though, and have a working
> example, let me know.
Mozharness has a long history of temp files, esp since we usually want actions to be standalone and not always need to hit the network. So a `download` action might save a temp file in work_dir, and a `do_something_with_download` action would then reference the temp file. The latter action could be runnable standalone if someone put the file in the expected place before running. Basically, I'm not worried about using a temp file here.
Assignee | ||
Comment 12•8 years ago
|
||
Thanks Aki!
Comment 13•8 years ago
|
||
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/35ef9cd54519
Support Content-Encoding header in mozharness when downloading from a url,r=aki
Comment 14•8 years ago
|
||
Comment 15•8 years ago
|
||
bugherder |
Assignee | ||
Comment 16•8 years ago
|
||
(In reply to Greg Arndt [:garndt] from comment #14)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=ec7ce9a1dd1a787c4d82d820809daf08cc6a3bd3
You've submitted a task against a worker type "gecko-t-win10-64-gpu-beta" which does not exist, so the queue has rejected the submission, and your decision task has failed. We only have a beta worker type for gecko-t-win10-64, not for gecko-t-win10-64-gpu. This is unrelated to this bug.
Comment 17•8 years ago
|
||
backed out for causing bug 1358493
Status: RESOLVED → REOPENED
Flags: needinfo?(pmoore)
Resolution: FIXED → ---
Assignee | ||
Comment 18•8 years ago
|
||
I'm so sorry.
Locally I have python 2.7.13, and this is working, do you know what version is used by the nightlies?
On 2.7.13, syntax is accepted:
$ python
Python 2.7.13 (default, Dec 17 2016, 23:03:43)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> import shutil
>>> file_name='abc'
>>> with gzip.open(file_name + '.gz', 'rb') as f_in, open(file_name, 'wb') as f_out:
... shutil.copyfileobj(f_in, f_out)
...
>>> quit()
Flags: needinfo?(pmoore)
Comment 19•8 years ago
|
||
Backout by cbook@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/c8198aa6e767
Backed out changeset 35ef9cd54519 for breaking android nightlys
Comment 20•8 years ago
|
||
> (In reply to Greg Arndt [:garndt] from comment #14)
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=ec7ce9a1dd1a787c4d82d820809daf08cc6a3bd3
>
> You've submitted a task against a worker type "gecko-t-win10-64-gpu-beta"
> which does not exist, so the queue has rejected the submission, and your
> decision task has failed. We only have a beta worker type for
> gecko-t-win10-64, not for gecko-t-win10-64-gpu. This is unrelated to this
> bug.
Sorry, I didn't intentionally comment on this bug, and I have already fixed the decision task stuff elsewhere (you are right about the wrong worker type). I think because one of the commits in my push contained a bug number that it updated this bug with the try link.
Assignee | ||
Comment 21•8 years ago
|
||
(In reply to Greg Arndt [:garndt] from comment #20)
> Sorry, I didn't intentionally comment on this bug, and I have already fixed
> the decision task stuff elsewhere (you are right about the wrong worker
> type). I think because one of the commits in my push contained a bug number
> that it updated this bug with the try link.
Ahhhhhh sorry, i thought you posted a link, and I thought it was strange you didn't put a comment! :) Now it all makes sense!
Assignee | ||
Comment 22•8 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #18)
> I'm so sorry.
>
> Locally I have python 2.7.13, and this is working, do you know what version
> is used by the nightlies?
>
>
> On 2.7.13, syntax is accepted:
>
>
> $ python
> Python 2.7.13 (default, Dec 17 2016, 23:03:43)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import gzip
> >>> import shutil
> >>> file_name='abc'
> >>> with gzip.open(file_name + '.gz', 'rb') as f_in, open(file_name, 'wb') as f_out:
> ... shutil.copyfileobj(f_in, f_out)
> ...
> >>> quit()
So it looks like it must be python 2.6 (or earlier), since python was the first release to start supporting with statements containing multiple clauses. So I'm updating (and testing) to be 2.6 compatible.
Assignee | ||
Comment 23•8 years ago
|
||
This is the same patch as before, but with the following two changes:
1) It is now python 2.6 compatible (gzip.open(...) in python 2.6 cannot be used within a with statement)
2) I've explicitly set an extra window bit on the in-memory zlib decompression to explicitly ensure that gzip decompression is enabled in the zlib.decompress call.
I've tested both statement blocks in isolation under python 2.6 and python 2.7.
The following try push is also in progress (just linux64 opt builds+tests as a random selection):
* https://treeherder.mozilla.org/#/jobs?repo=try&revision=127a79dfd40fca2dd5f935ddf36c34989a280ecd
Thanks Aki!
Attachment #8860050 -
Attachment is obsolete: true
Attachment #8860476 -
Flags: review?(aki)
Assignee | ||
Comment 24•8 years ago
|
||
ah just spotted a double closed bracket mistake when copy/pasting - will fix and reattach - sorry!
Updated•8 years ago
|
Attachment #8860476 -
Flags: review?(aki) → review+
Assignee | ||
Comment 25•8 years ago
|
||
New try push here:
* https://treeherder.mozilla.org/#/jobs?repo=try&revision=243264e5fd7c608dda2890a60fe3676188c1225e
Attachment #8860476 -
Attachment is obsolete: true
Attachment #8860478 -
Flags: review?(aki)
Comment 26•8 years ago
|
||
Attachment #8860478 -
Flags: review?(aki) → review+
Comment 27•8 years ago
|
||
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b255dbe5bc86
Support Content-Encoding header in mozharness when downloading from a url,r=aki
Comment 28•8 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•