Closed Bug 1349261 Opened 3 years ago Closed 2 years ago

./mach taskcluster-load-image fails if zstd is not new enough

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: ydidwania, Mentored)

References

Details

(Keywords: good-first-bug)

Ubuntu ships a version of zstd which is too old to open our images.

The result is that the mach command hangs -- it appears not to catch the exit status from zstd.
Please don't use zstd < 1.0 in production. The data format is incompatible with 1.0+.

If you have data produced by <1.0, it is possible to compile zstd with legacy data format support. python-zstandard supports this.
Also, the latest version of python-zstandard supports multi-threaded compression (which isn't enabled by default when you compile zstandard so it probably isn't in any distro packages). This should enable faster compression which should reduce end-to-end times. Just switch to python-zstandard already ;)
We're not using it in production. The issue is, unless you have compiled it from source, it's probably what you have installed on your desktop.

Do you want to switch taskcluster-load-image to use python-zstandard?  That sounds like a nice fix..
I would like to see things switched to python-zstandard for the multi-threaded compression, yes. Do I want to do the work, not exactly: I have to dig myself out of 1 month of other backlog first!

Also, python-zstandard is only really beneficial on the compression side, as decompression doesn't use multiple threads. Also, if you switch to multi-threaded compression, you could crank up the compression level and burn more CPU time for smaller images, which should help I/O latency. You may want to read http://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/
Sorry, yes, I was just looking for approval, not doing the work :)
Mentor: gps
Depends on: 1350447
Hi, I am new to TaskCluster. I would like to work on this bug. Could you provide some more information about the issue?
Flags: needinfo?(dustin)
Hi!

Before I get started: I'm not sure what your level of familiarity is, so please ask questions.  You can stop by the #tc-contributors channel in IRC to get a quicker response.  There's some getting-started docs at https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Introduction and in particular you'll want to be able to get the source checked out and run mach (https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/mach#Running) in order to work on this bug.  You don't need to build Firefox.

You can see the issue pretty easily:

$ ./mach taskcluster-load-image desktop-build
######################################################################## 100.0%zstd: stdin: not in zstd format 

that command is *supposed* to go out and find the latest task that built the desktop-build image, and then download it, decompress it with zstd, and load it into docker.  Most of that works just fine, but the version of `zstd` installed on my laptop (and probably yours too) is older than 1.0, and doesn't understand the 1.0 format used to compress this data.

If you want to see it work correctly, you might do what I mentioned in comment 3 and build a copy of zstd from source, and put it somewhere in PATH.  Then the command above should work fine.

But to fix this issue for all developers, we'll need to take another approach.  Have a look at the load_image function in taskcluster/taskgraph/docker.py, and see what you think might be a good solution.  Greg's comments above suggest that there's a Python library (python-zstandard) which might be able to do the decompression.
Flags: needinfo?(dustin)
Assignee: nobody → didwaniayashvardhan
So maybe:
  self._activate_virtualenv()
  self.virtualenv_manager.install_pip_package('python-zstandard...')
in mach_commands.py

and then use python-zstandard to decompress, that way python-zstandard will be installed on demand, which is okay as the command is only used locally...
Sorry for taking so long.
I ran the command with both zstd v0.8.1 I had installed on my laptop and v1.3.0.2 built locally.

For v0.8.1
progress bar goes to 100 and then comes back to 0 again and never terminates

For v1.3.0.2

$ ./mach taskcluster-load-image desktop-build
######################################################################## 100.0%     

Here also the process never terminates. I kept it running for at least 5 hours.  

python-zstandard 0.8.1 is latest. How would taskcluster/taskgraph/docker.py need to be changed to accomodate this change
Flags: needinfo?(dustin)
I think the version numbers for python-zstandard are different than those for zstd, although the coincidence of version numbers does give me pause.  There appears to be a `python-zstd` which goes up to 1.3.0.2.  And the latest zstd (not python) is 1.3.1 [1]. From greg's notes above, we want to use python-zstandard, not python-zstd.  It's quite a mess -- hopefully the folks behind zstd will clear up some of this confusion eventually!

Regarding your experiment above, it's a good idea that you're trying to replicate the issue!  Which one did you build from source?  From the version numbers above it sounds like you installed python-zstd, which is not what the mach command uses -- so it makes sense that you didn't see much difference.  I suspect you could get the download to work by installing zstd from [1].

[1] https://github.com/facebook/zstd/releases
Flags: needinfo?(dustin)
I had built v1.3.0.2 locally.
Difference between the two versions was that in v0.8.1 the progress bar resets to something like

$ ./mach taskcluster-load-image desktop-build
                                                                           0.0%

after reaching 100%. This does not happen in v1.3.0.2 where it goes to 100% in a couple of seconds and stay.

Anyways, using [1] above worked although I faced a few problems setting it up. It took a few minutes to reach 100%.   

$ sudo ./mach taskcluster-load-image desktop-build
[sudo] password for jukebox: 
######################################################################## 100.0%
edda198f80e9: Loading layer    202MB/202MB
e4fd29158e1f: Loading layer  1.024kB/1.024kB
d1122239dfcf: Loading layer  1.024kB/1.024kB
.
.
Found docker image: desktop-build:42fc524d1f4027c901a6ea2713cfb7ed4eb0088392097cd2880c479f8c5f123e
Try: docker run -ti --rm desktop-build:42fc524d1f4027c901a6ea2713cfb7ed4eb0088392097cd2880c479f8c5f123e bash

$ sudo docker run -ti --rm desktop-build:42fc524d1f4027c901a6ea2713cfb7ed4eb0088392097cd2880c479f8c5f123e bash
[root@447bcfb614ae ~]# exit
exit

That is what we are trying to achieve from this bug right?
From what I understand, these images are an alternative to a virtual machine but it has only the required depenndencies, not another OS(like vagrant?) . Please correct me if I am wrong
Flags: needinfo?(dustin)
Yes, that is what we are trying to achieve -- without having to install zstd, and using python-zstandard instead.  Python-zstd is not relevant at all.

In general, that's what Docker images are.  For our purposes specifically, the images are only available as Docker images so there is no alternative.
Flags: needinfo?(dustin)
I tried installing python-zstandard and getting it to work. However I am getting an ImportError

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import zstandard
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named zstandard
>>>

I tried all of the possible solutions mentioned in the [official documentation](https://github.com/indygreg/python-zstandard).
Even tried to build the extension locally. Can't seem to get it working(Need Help). 

Once I can I think the fix should be modifying the taskcluster/taskgraph/docker.py to something like

dctx = zstd.ZstdDecompressor()
decompressed = dctx.decompress(curl.stdout)
tarin = tarfile.open(mode='r|', fileobj=zstd.stdout)

This will work provided the content size is encoded in the header or we can specify a max_output_size in dctx.decompress()
Flags: needinfo?(dustin)
I received the same error after installing and trying to import.  Greg, do you have any suggestions?
Flags: needinfo?(gps)
Flags: needinfo?(dustin)
dustin@jemison ~ $ tmp/ve/bin/pip install zstandard
Collecting zstandard
  Downloading http://localhost:3141/root/pypi/+f/f97/ba5cb2929f704/zstandard-0.8.1.tar.gz (463kB)
    100% |████████████████████████████████| 471kB 761kB/s 
Building wheels for collected packages: zstandard
  Running setup.py bdist_wheel for zstandard ... done
  Stored in directory: /home/dustin/.cache/pip/wheels/a7/cc/a9/31faa9d7baa2cdc1a3165d40601b1187dee85625be397614a6
Successfully built zstandard
Installing collected packages: zstandard
Successfully installed zstandard-0.8.1
dustin@jemison ~ $ tmp/ve/bin/python
Python 2.7.13 (default, May 10 2017, 20:04:28) 
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import zstandard
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named zstandard
>>> import zstd

So.. repo is "python-zstandard", pip package is "zstandard", and the module name is "zstd".  And the README indicates the module is named "zstandard".  That's pretty confusing!!

Yes, I think your proposed approach is the right one.
>>>import zstd 

does not import the python-zstandard library.It instead imports the zstd(not python)v0.8.1 (or whichever is installed on your laptop)which we have to migrate from.

In my approach I forgot to mention,

import zstandard as zstd

Sorry for the inconvenience.
You can't import something that isn't Python.  But just to check what `import zstd` gets:

>>> print zstd.__file__
/home/dustin/tmp/ve/lib/python2.7/site-packages/zstd.so

which is freshly installed in my virtualenv today:

dustin@jemison ~ $ ls -al /home/dustin/tmp/ve/lib/python2.7/site-packages/zstd.so
-rwxrwxr-x. 1 dustin dustin 2037416 Aug 29 16:33 /home/dustin/tmp/ve/lib/python2.7/site-packages/zstd.so

I full agree this is massively confusing.  Hopefully :gps can straighten us out a little.
There is another zstd python library which also provides the python bindings .

https://pypi.python.org/pypi/zstd

Yes. Lets wait for :gps to clear out things.
Do not use https://pypi.python.org/pypi/zstd. It contains support for a legacy, pre-release version of the zstandard archive format. We don't want archives in that format (silently) being used because it is a deprecated format. The zstandard package on PyPI (which I'm the author of) doesn't support these legacy formats (unless you pass an argument to setup.py and do a custom build). The zstandard PyPI package is also superior in terms of features and performance. It should be used. (I don't think "zstd" even provides streaming decompression, which for Docker images means you may need to allocate gigabytes of memory to perform compression and decompression, which will likely yield out of memory errors.)

Also, the "zstandard" PyPI package uses the "zstd" package until the unreleased 0.9 version. The Python package name was changed to "zstandard" in 0.9 to avoid conflicts with the "zstd" PyPI package. The README on PyPI correctly states the old name. I think you are confused because you are reading the README on GitHub, which is tracking the master branch and therefore the unreleased 0.9 release. Sadly, PyPI doesn't format the ReST for zstandard because it is too large. Derp. Use https://github.com/indygreg/python-zstandard/blob/b3a8f1a7f5b3a791a1d9fed372d2b0d63fb66290/README.rst for zstandard==0.8.1.
Flags: needinfo?(gps)
Okay. Thanks for the clarification.
I don't seem to understand though, how do we then import the zstandard(0.8.1 latest released) package since the name was changed only in the 0.9 version?
The Readme on PyPI gives the name as "zstandard" but 
>>>import zstandard  
gives ImportError
Flags: needinfo?(gps)
  $ pip install zstandard==0.8.1
  $ python
  >>> import zstd

The PyPI package name does not have any relation to the actual Python package name (although they are commonly the same for good reason).
Flags: needinfo?(gps)
Look at https://github.com/indygreg/python-zstandard/tree/0.8.1 for the 0.8.1 docs.

But, does it make sense to implement this now when everything is going to change incompatibly soon?  Should we just leave this until the zstandard universe has stabilized/
(In reply to Gregory Szorc [:gps] from comment #21)
>   $ pip install zstandard==0.8.1
>   $ python
>   >>> import zstd
> 
> The PyPI package name does not have any relation to the actual Python
> package name (although they are commonly the same for good reason).

Yeah. I figured it out just now that zstandard and zstd were shared libraries and they had the common zstd.so file
(In reply to Dustin J. Mitchell [:dustin] from comment #22)
> Look at https://github.com/indygreg/python-zstandard/tree/0.8.1 for the
> 0.8.1 docs.
> 
> But, does it make sense to implement this now when everything is going to
> change incompatibly soon?  Should we just leave this until the zstandard
> universe has stabilized/

If you write `import zstd as zstandard` or `import zstandard as zstd`, you only need to change one line when you upgrade the Python package version. Of course, if you fail to pin the package version, you get hit by nastiness. This is why packages should be pinned and why version numbers exist :)
Ah, OK, I interpreted "Also, the "zstandard" PyPI package uses the "zstd" package until the unreleased 0.9 version" as meaning that before 0.9, zstandard only supports the old formats ("It contains support for a legacy, pre-release version of the zstandard archive format").  OK, proceed!
I made the following changes:

curl = Popen(['curl', '-#', '--fail', '-L', '--retry', '8', url], stdout=PIPE).communicate()
dctx = zstd.ZstdDecompressor()
decompressed = dctx.decompress(curl.stdout)
tarin = tarfile.open(mode='r|', fileobj=decompressed)

I had to add the ".communicate()" to make it a blocking call .Few questions-
1) Is the above the correct way to create a .tar file with the decompressed data?
2) Is the size of the image encoded in its header as required by the Simple API?


Initially I got an AttributeError for 
`raise error` 
(line 201. I am sorry I donot have the terminal log)

On trying few more things, I am now getting

$ ./mach taskcluster-load-image desktop-build
######################################################################## 100.0%
Traceback (most recent call last):
  File "/home/jukebox/mozilla-central/taskcluster/mach_commands.py", line 482, in load_image
    ok = load_image_by_name(image_name, tag)
  File "/home/jukebox/mozilla-central/taskcluster/taskgraph/docker.py", line 37, in load_image_by_name
    return load_image_by_task_id(task_id, tag)
  File "/home/jukebox/mozilla-central/taskcluster/taskgraph/docker.py", line 42, in load_image_by_task_id
    result = load_image(artifact_url, tag)
  File "/home/jukebox/mozilla-central/taskcluster/taskgraph/docker.py", line 191, in load_image
    raise Exception('failed to download from url: {}'.format(url))
Exception: failed to download from url: https://queue.taskcluster.net/v1/task/S9Cvu_uoS4qqbZgOhbe38Q/artifacts/public/image.tar.zst


I guess I am not being allowed to download as I may have run it more than the permitted number of times.
I was planning to use the streaming APIs once I was able to get it working with the simple APIs

Doc - https://github.com/indygreg/python-zstandard/tree/0.8.1
Flags: needinfo?(gps)
Flags: needinfo?(dustin)
You should be able to download that as many times as you want.  It's interesting that `curl` didn't give you an error there - that message should only occur when curl has failed, and we assume that curl printed some error message to stderr.

Regarding adding .communicate -- that will make the first line (Popen([..]).communicate()) block until curl has downloaded the entire file and written the output to the pipe.  However, you don't read from that pipe until two lines later.  So unless the pipe can hold a few GB (it can't!) this will end up deadlocking as curl waits for space in the pipe, and the Python script waits for curl to finish.

I don't think you need the .communicate().  Without it, curl should run in parallel to the Python script, and dctx.decompress should pull the data from curl's pipe as it is available (then hand the decompressed result to tarfile).  You will need to call curl.wait() when everything is done to clean up and get its exit status, though.

I don't know the answer to (2).
Flags: needinfo?(dustin)
Dustin answered for me.

I'll add that I'm not sure why curl is being used. We should be able to `import requests` and use it to obtain a Python stream of the HTTP response body. No process pipes will come into play.
Flags: needinfo?(gps)
Tried to get it working with curl first.


curl = Popen(['curl', '-#', '--fail', '-L', '--retry', '8', url], stdout=PIPE)
dctx=zstd.ZstdDecompressor()

with dctx.write_to(fh) as decompressor:
    decompressor.write(curl.stdout)

tarin = tarfile.open(mode='r|', fileobj=fh)


I get the following result
        
$ ./mach taskcluster-load-image desktop-build
######################################################################## 100.0%^Z
[4]+  Stopped                 ./mach taskcluster-load-image desktop-build
jukebox@ThinkPad-E470:~/mozilla-central$

Loads to 100% in around 5s and never terminates.
decompressor.write(curl.stdout) is called whenever PIPE is  non-empty.

I guess the issue is because decompressor.write() keeps expecting to pull from PIPE even when curl is finished downloading and PIPE is empty. I want to stream the decompressed data from fh and write it to tarin. However, the Streaming Output API in zstandard streams the decompressed data as an iterator of data chunks. How do we write these chunks to the fileobj in tarfile?
Flags: needinfo?(gps)
The API you are asking for is explicitly asked for in https://github.com/indygreg/python-zstandard/issues/23. Instructions on how to polyfill it are in https://github.com/indygreg/python-zstandard/issues/13.

I should piece together a new release of python-zstandard with this API. I've been busy with so many other things.
Flags: needinfo?(gps)
I tried the polyfill given in https://github.com/indygreg/python-zstandard/issues/13.

curl = Popen(['curl', '-#', '--fail', '-L', '--retry', '8', url], stdout=PIPE)
dctx=zstd.ZstdDecompressor()
bytesio = BytesIO(b''.join(dctx.read_from(curl.stdout)))
decompressed=bytesio.readline.decode()
tarin = tarfile.open(mode='r|', fileobj=decompressed)

raises MemoryError 

We are trying to store the decompressed file( about 8gb ) in the BytesIO buffer which I think is causing the problem.
Am I right in thinking we need a streaming input along with a streaming output to and from the decompressor to ensure that the PIPE is not full as well as the decompressed data is sent to tarfile incrementally and not stored anywhere in between .
Flags: needinfo?(gps)
As you discovered, we can't buffer everything in memory for these Docker images because they are too large. Everything needs to be streamed end-to-end.

I did hack on python-zstandard a little over the weekend. I think I could get a new release out the door with minimal effort. The code in master does have a new "stream reader" API that should facilitate streaming decompression with tarfile. See https://github.com/indygreg/python-zstandard/blob/master/README.rst#stream-reader-api-1.

Could you try compiling the latest version from master and verify it meets your API needs? If so, I could try to get 0.9 out the door by end of the week.
Flags: needinfo?(gps)
I went through the docs of Stream Reader API a few minutes back. Looks like it should work . I need to figure out how to keep sending those chunks to tarfile . Will confirm by tomorrow . Thanks
We need to keep writing the chunks to tarin.fileobj. At the same time, read the members of tarin and write them to tarout. I tried doing it by using BytesIO() as the fileobj.

(1)  bytesio=BytesIO()        
     tarin = tarfile.open(fileobj=bytesio , mode='r|')

     Gives a ReadError : empty file 
     Hence I am not able to test this method.
     So we need to manage this somehow.

(2)  All chunks are written to bytesio. To prevent a MemoryError , I am updating the bytesio as
      
      bytesio.seek(stream_pos)
      bytesio=BytesIO(bytesio.read())
       
      stream_pos = sum of sizes of all handled members in tarin
      So, basically reinitialize the buffer so that used data is removed and bytesio.read() starts from the remainder of the chunk.

If we are able to get past (1) somehow, I think this approach should work.

Here's the code: https://pastebin.com/qzwkn8uM

Ofcourse, I have  to update the comments and variable names.
Sorry for taking longer :)
Flags: needinfo?(gps)
Does something like the following work?

   curl = ...
   dctx = zstandard.ZstdDecompressor()
   with dctx.stream_reader(curl.stdout) as dfh:
        tarin = tarfile.open(fileobj=dfh, mode='r|')
        for member in tarin:
            ...

This /should/ allow you to iterate over members of a zstd compressed tar file and then forward them elsewhere. You shouldn't need to manage any intermediate buffers or perform any seeks. The only thing I see getting in the way of this is if the tarfile API doesn't let you easily copy data between 2 non-seekable streams. But I /think/ it should.
Flags: needinfo?(gps)
It worked perfectly. Thanks
I had read up that fileobj can be any object with a read() or write() method and dfh had the read() method. 
I should have thought of this.

Anyways, how do I submit the patch now?
Flags: needinfo?(gps)
Hi Greg, you'd need to release version 0.9 before I can submit the patch. Let me know when you are done.
Flags: needinfo?(gps)
Blocks: 1405987
FWIW I started hacking on python-zstandard again. I'm aiming to get a 0.9 release out sometime this month.

I'll leave the needinfo set to remind myself to update this bug when I release something.
This appears to get fixed in https://reviewboard.mozilla.org/r/215576/diff/1#index_header, by pip-installing zstd if it's not new enough.  Yash, do you want to have a look at that patch?
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Sorry Dustin, I missed this. I hope its fine  to comment now.
I looked at the patch, the IteratorReader class is a good idea.
The function defined on line 379 , shouldn't it be `_ensure_zstd` instead of `_ensure_sztd` ?
Indeed, it's been spelled wrong twice!
Flags: needinfo?(gps)
Yes, the two mistakes ensured zstd was working fine:)
A case of two negatives making a positive !
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.