1358142 - Support Content-Encoding header when downloading from urls

Assignee

Description

•

8 years ago

When adding support in taskcluster generic-worker for publishing gzip-encoded artifacts in bug 1347956, I discovered that mozharness is still using urllib2 which in the deployed version, amazingly still doesn't natively respect the Content-Encoding header defined in the HTTP/1.1 protocol defined in RFC 2616. I have written a patch to fix this in mozharness, which I'm currently running in this try push: https://treeherder.allizom.org/#/jobs?repo=try&revision=3b4f51c00db45603c243451a44d5715b7d00f020 There are two different download mechanisms in: * testing/mozharness/mozharness/base/script.py I fixed this in both: * ScriptMixin.fetch_url_into_memory * ScriptMixin._download_file Note, I think currently only the _download_file method is used for fetching artifacts (I'm not 100% sure), but it is no doubt useful to handle this in both places, since Content-Encoding is a standard http header, and fetch_url_into_memory can only benefit from supporting it, if at any point in future the urls it hits start using this header.

Ed Morley [:emorley]

Comment 1

•

8 years ago

I believe there is a copy of requests in the repo - any chance of switching to that instead?

Pete Moore [:pmoore][:pete]

Assignee

Comment 2

•

8 years ago

I wanted to keep the delta small, especially as this isn't an area of expertise for me - also this is blocking roll out of other g-w fixes that are blocking testers on win10 at the moment, so I had to choose between simple patching, or backing out entirely - but we can create a separate bug to move over to requests library if you like. The change I made here seems to work locally, so I'd prefer not to rewrite it all at this time, but leave that as something for the future when other priorities are not in the way.

Pete Moore [:pmoore][:pete]

Assignee

Comment 3

•

8 years ago

(when i say back out entirely - I mean I made a generic-worker release with critical fixes in it, *plus* the stuff from 1347956, and 1347956 depends on this bug - so I'd have to pull out the stuff in 1347956 from generic-worker, and make a new release of it, and roll that out, in order to unblock win10 testing - so I chose the route of a simple fix in mozharness to avoid the need for new generic-worker releases etc)

Pete Moore [:pmoore][:pete]

Assignee

Comment 4

•

8 years ago

Attached patch bug1358142_gecko_v1.patch (obsolete) — Details — Splinter Review

Assignee: nobody → pmoore

Status: NEW → ASSIGNED

Attachment #8860050 - Flags: review?(aki)

Pete Moore [:pmoore][:pete]

Assignee

Comment 5

•

8 years ago

Aki, I renamed FetchedIncorrectFilesize to ContentLengthMismatch as I thought this was clearer that the amount of data retrieved from the url was not consistent with the Content-Length header received, rather than this being user data - "I expect the downloaded file to be of this size". After I made the change I noticed that there is an edge case where something might use the fetch_url_into_memory to read a file - in which case the content length of a file url technically doesn't come from a Content-Length header. However, I think this is an edge case. The only place I see this might be a problem is: > 'The retrieved Content-Length header declares a body length of {} bytes, while we actually retrieved {} bytes' I guess I could generate an error based on the url scheme, but it should never be possible that if os.stat() says the file size is something, and after reading the file, we have a different number of bytes (if so, we have a much more serious problem with the worker), so I don't think this is an edge case we need to be particularly concerned with - we probably don't need to validate file size for file urls at all, but I left the code as it is, to avoid increasing the delta of this change.

Pete Moore [:pmoore][:pete]

Assignee

Comment 6

•

8 years ago

Also note ... I couldn't find a trivial way to gunzip the content without either reading it all into memory, or downloading it as a temp file, so I went for the temp file option. If you know a simple way to gunzip on the fly without reading everything into memory, I'm all ears. From what I could see, in order to decompress, the compression library needs to seek through the data, and of course you can't easily seek back and forth on an http response body, hence the reason to write to a compressed file first, and then decompress that. If you know of something though, and have a working example, let me know.

Pete Moore [:pmoore][:pete]

Assignee

Comment 7

•

8 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #6) > Also note ... I couldn't find a trivial way to gunzip the content without > either reading it all into memory, or downloading it as a temp file, so I > went for the temp file option. If you know a simple way to gunzip on the fly > without reading everything into memory, I'm all ears. From what I could see, > in order to decompress, the compression library needs to seek through the > data, and of course you can't easily seek back and forth on an http response > body, hence the reason to write to a compressed file first, and then > decompress that. If you know of something though, and have a working > example, let me know. To demonstrate the problem I had with compressing on-the-fly: pmoore@Petes-iMac:~/hg/mozilla-central/testing/mozharness $ python Python 2.7.13 (default, Dec 17 2016, 23:03:43) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import urllib2 >>> handle = urllib2.urlopen('https://queue.taskcluster.net/v1/task/Rfdymk6_RyeSCVmBGdnsMA/artifacts/public/build/firefox-55.0a1.en-US.win32.test_packages.json', timeout=30) >>> import gzip >>> gzipFile = gzip.GzipFile(fileobj=handle) >>> x = gzipFile.read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 261, in read self._read(readsize) File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 295, in _read pos = self.fileobj.tell() # Save current position AttributeError: addinfourl instance has no attribute 'tell'

Aki Sasaki (not active)

Comment 8

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #1) > I believe there is a copy of requests in the repo - any chance of switching > to that instead? We run a lot of automation by grabbing a zip of mozharness and running without any venv setup, so this change might break a lot of things.

Aki Sasaki (not active)

Comment 9

•

8 years ago

Comment on attachment 8860050 [details] [diff] [review] bug1358142_gecko_v1.patch I think this looks good overall. This is pretty core, so don't be surprised if something unexpected breaks. I'm going to respond to some of your other comments in-bug.

Attachment #8860050 - Flags: review?(aki) → review+

Aki Sasaki (not active)

Comment 10

•

8 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #5) > Aki, I renamed FetchedIncorrectFilesize to ContentLengthMismatch as I > thought this was clearer that the amount of data retrieved from the url was > not consistent with the Content-Length header received, rather than this > being user data - "I expect the downloaded file to be of this size". > > After I made the change I noticed that there is an edge case where something > might use the fetch_url_into_memory to read a file - in which case the > content length of a file url technically doesn't come from a Content-Length > header. However, I think this is an edge case. > > The only place I see this might be a problem is: > > > 'The retrieved Content-Length header declares a body length of {} bytes, while we actually retrieved {} bytes' > > I guess I could generate an error based on the url scheme, but it should > never be possible that if os.stat() says the file size is something, and > after reading the file, we have a different number of bytes (if so, we have > a much more serious problem with the worker), so I don't think this is an > edge case we need to be particularly concerned with - we probably don't need > to validate file size for file urls at all, but I left the code as it is, to > avoid increasing the delta of this change. This makes sense. Otherwise we could declare error_msg = "The retrieved Content-Length ..." and override in the if block, and error_msg.format(...) on error.

Aki Sasaki (not active)

Comment 11

•

8 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #6) > Also note ... I couldn't find a trivial way to gunzip the content without > either reading it all into memory, or downloading it as a temp file, so I > went for the temp file option. If you know a simple way to gunzip on the fly > without reading everything into memory, I'm all ears. From what I could see, > in order to decompress, the compression library needs to seek through the > data, and of course you can't easily seek back and forth on an http response > body, hence the reason to write to a compressed file first, and then > decompress that. If you know of something though, and have a working > example, let me know. Mozharness has a long history of temp files, esp since we usually want actions to be standalone and not always need to hit the network. So a `download` action might save a temp file in work_dir, and a `do_something_with_download` action would then reference the temp file. The latter action could be runnable standalone if someone put the file in the expected place before running. Basically, I'm not worried about using a temp file here.

Pete Moore [:pmoore][:pete]

Assignee

Comment 12

•

8 years ago

Thanks Aki!

Pulsebot

Comment 13

•

8 years ago

Pushed by pmoore@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/35ef9cd54519 Support Content-Encoding header in mozharness when downloading from a url,r=aki

Greg Arndt [:garndt]

Comment 14

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec7ce9a1dd1a787c4d82d820809daf08cc6a3bd3

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 15

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/35ef9cd54519

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox55: --- → fixed

Resolution: --- → FIXED

Pete Moore [:pmoore][:pete]

Assignee

Comment 16

•

8 years ago

(In reply to Greg Arndt [:garndt] from comment #14) > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=ec7ce9a1dd1a787c4d82d820809daf08cc6a3bd3 You've submitted a task against a worker type "gecko-t-win10-64-gpu-beta" which does not exist, so the queue has rejected the submission, and your decision task has failed. We only have a beta worker type for gecko-t-win10-64, not for gecko-t-win10-64-gpu. This is unrelated to this bug.

Carsten Book [:Tomcat]

Updated

•

8 years ago

Blocks: 1358493

Carsten Book [:Tomcat]

Comment 17

•

8 years ago

backed out for causing bug 1358493

Status: RESOLVED → REOPENED

status-firefox55: fixed → affected

Flags: needinfo?(pmoore)

Resolution: FIXED → ---

Pete Moore [:pmoore][:pete]

Assignee

Comment 18

•

8 years ago

I'm so sorry. Locally I have python 2.7.13, and this is working, do you know what version is used by the nightlies? On 2.7.13, syntax is accepted: $ python Python 2.7.13 (default, Dec 17 2016, 23:03:43) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import gzip >>> import shutil >>> file_name='abc' >>> with gzip.open(file_name + '.gz', 'rb') as f_in, open(file_name, 'wb') as f_out: ... shutil.copyfileobj(f_in, f_out) ... >>> quit()

Flags: needinfo?(pmoore)

Pulsebot

Comment 19

•

8 years ago

Backout by cbook@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/c8198aa6e767 Backed out changeset 35ef9cd54519 for breaking android nightlys

Greg Arndt [:garndt]

Comment 20

•

8 years ago

> (In reply to Greg Arndt [:garndt] from comment #14) > > https://treeherder.mozilla.org/#/ > > jobs?repo=try&revision=ec7ce9a1dd1a787c4d82d820809daf08cc6a3bd3 > > You've submitted a task against a worker type "gecko-t-win10-64-gpu-beta" > which does not exist, so the queue has rejected the submission, and your > decision task has failed. We only have a beta worker type for > gecko-t-win10-64, not for gecko-t-win10-64-gpu. This is unrelated to this > bug. Sorry, I didn't intentionally comment on this bug, and I have already fixed the decision task stuff elsewhere (you are right about the wrong worker type). I think because one of the commits in my push contained a bug number that it updated this bug with the try link.

Pete Moore [:pmoore][:pete]

Assignee

Comment 21

•

8 years ago

(In reply to Greg Arndt [:garndt] from comment #20) > Sorry, I didn't intentionally comment on this bug, and I have already fixed > the decision task stuff elsewhere (you are right about the wrong worker > type). I think because one of the commits in my push contained a bug number > that it updated this bug with the try link. Ahhhhhh sorry, i thought you posted a link, and I thought it was strange you didn't put a comment! :) Now it all makes sense!

Pete Moore [:pmoore][:pete]

Assignee

Comment 22

•

8 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #18) > I'm so sorry. > > Locally I have python 2.7.13, and this is working, do you know what version > is used by the nightlies? > > > On 2.7.13, syntax is accepted: > > > $ python > Python 2.7.13 (default, Dec 17 2016, 23:03:43) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> import gzip > >>> import shutil > >>> file_name='abc' > >>> with gzip.open(file_name + '.gz', 'rb') as f_in, open(file_name, 'wb') as f_out: > ... shutil.copyfileobj(f_in, f_out) > ... > >>> quit() So it looks like it must be python 2.6 (or earlier), since python was the first release to start supporting with statements containing multiple clauses. So I'm updating (and testing) to be 2.6 compatible.

Pete Moore [:pmoore][:pete]

Assignee

Comment 23

•

8 years ago

Attached patch bug1358142_gecko_v2.patch (obsolete) — Details — Splinter Review

This is the same patch as before, but with the following two changes: 1) It is now python 2.6 compatible (gzip.open(...) in python 2.6 cannot be used within a with statement) 2) I've explicitly set an extra window bit on the in-memory zlib decompression to explicitly ensure that gzip decompression is enabled in the zlib.decompress call. I've tested both statement blocks in isolation under python 2.6 and python 2.7. The following try push is also in progress (just linux64 opt builds+tests as a random selection): * https://treeherder.mozilla.org/#/jobs?repo=try&revision=127a79dfd40fca2dd5f935ddf36c34989a280ecd Thanks Aki!

Attachment #8860050 - Attachment is obsolete: true

Attachment #8860476 - Flags: review?(aki)

Pete Moore [:pmoore][:pete]

Assignee

Comment 24

•

8 years ago

ah just spotted a double closed bracket mistake when copy/pasting - will fix and reattach - sorry!

Aki Sasaki (not active)

Updated

•

8 years ago

Attachment #8860476 - Flags: review?(aki) → review+

Pete Moore [:pmoore][:pete]

Assignee

Comment 25

•

8 years ago

Attached patch bug1358142_gecko_v3.patch — Details — Splinter Review

New try push here: * https://treeherder.mozilla.org/#/jobs?repo=try&revision=243264e5fd7c608dda2890a60fe3676188c1225e

Attachment #8860476 - Attachment is obsolete: true

Attachment #8860478 - Flags: review?(aki)

Aki Sasaki (not active)

Comment 26

•

8 years ago

Comment on attachment 8860478 [details] [diff] [review] bug1358142_gecko_v3.patch Good catch :)

Attachment #8860478 - Flags: review?(aki) → review+

Pulsebot

Comment 27

•

8 years ago

Pushed by pmoore@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/b255dbe5bc86 Support Content-Encoding header in mozharness when downloading from a url,r=aki

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 28

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/b255dbe5bc86

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

status-firefox55: affected → fixed

Resolution: --- → FIXED

bug1358142_gecko_v1.patch 8 years ago Pete Moore [:pmoore][:pete] 11.38 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review
bug1358142_gecko_v2.patch 8 years ago Pete Moore [:pmoore][:pete] 12.15 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review
bug1358142_gecko_v3.patch 8 years ago Pete Moore [:pmoore][:pete] 12.15 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review