Closed Bug 1347582 Opened 3 years ago Closed 2 years ago

Docker image hashes are not totally stable

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED
mozilla56

People

(Reporter: dustin, Assigned: swapneshks, Mentored)

Details

(Keywords: good-first-bug)

Attachments

(5 files, 3 obsolete files)

In taskcluster/taskgraph/util/docker.py, we create a tarball and take its hash to determine whether the docker image source files have changed.

This seems to work in in-tree builds: the same source code consistently generates the same hash.  For example,
  https://treeherder.mozilla.org/logviewer.html#?job_id=83974842&repo=mozilla-central&lineNumber=603
found hash 6e990d40c40d5e458814fd7768041c3572d35e0d77c72c294c4f9071fa21948d for index-task, and that already existed.

However, when running with the same parameters, on the same revision, I get

df31fd1038c245177667ae0a95f6a8e092eedd6503ab99db7f9aa6bbd8666496

So the challenge here is to figure out what differs here.

At a guess, it has to do with Python version; my development machine has "Python 2.7.11+" (Fedora).  The decision image has "Python 2.7.12" (Ubuntu 16.04).  But it may be something else!

A good initial debugging technique would be to get a copy of the tarball from an in-tree run (by modifying the source to write out the tarball and pushing to try) and locally, and then start comparing the resulting tarballs.
This also means that ./mach taskcluster-load-image doesn't work.

Perhaps unrelated, downloading a `image.tar.zst` file directly and trying to uncompress leads to

 $ zstd -d < image.tar.zst  > image.tar
zstd: stdin: not in zstd format
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Perhaps unrelated, downloading a `image.tar.zst` file directly and trying to
> uncompress leads to
> 
>  $ zstd -d < image.tar.zst  > image.tar
> zstd: stdin: not in zstd format

Apparently that's because we require a version of zstd newer than that available in Ubuntu :)
Swapnesh, do you want to take a try at this?
Flags: needinfo?(swapneshks)
Yeah, sure Dustin. I'll have a look at this bug. :)
Flags: needinfo?(swapneshks)
Hi Dustin,

I am having some doubts while trying to run TaskCluster tasks:

1. For running locally, I pressed the "Run Locally" button in https://tools.taskcluster.net/task-inspector/#EyR0oUUkQj2bn4FDT8quSQ/ which showed a showed some commands. I copied them into a fresh local bash script and tried running the script. But it gets stuck at a root prompt:

> ~/Repos/Mozilla/src/mozilla-central$ ./taskcluster_EyR0oUUkQj2bn4FDT8quSQ.sh 
> sha256:0f59f922d86c471e208b7ea08ab077fc68c3920ed5e6895d69a23e8f3457dc24: Pulling from taskcluster/decision
> Digest: sha256:0f59f922d86c471e208b7ea08ab077fc68c3920ed5e6895d69a23e8f3457dc24
> Status: Image is up to date for taskcluster/decision@sha256:0f59f922d86c471e208b7ea08ab077fc68c3920ed5e6895d69a23e8f3457dc24
> root@fe22e8070134:/#

2. How to push a TaskCluster task to try after modifying the source? Any wiki/docs from where I can refer the steps?

Also, is my approach in 1. correct?
Flags: needinfo?(dustin)
The run-locally command seems to have an extra newline here:

  ${image_name} \

  /home/worker/bin/run-task ..

Try removing that newline?  That might be a good bug to fix, too (https://github.com/taskcluster/taskcluster-tools)

I think the best docs for try are https://wiki.mozilla.org/ReleaseEngineering/TryServer
Flags: needinfo?(dustin)
Looks like there are a couple of typos:

1. The extra newline you pointed in comment#6

2. Extra newline in -

/home/worker/bin/run-task '--vcs-checkout=/home/worker/checkouts/gecko' -- bash -cx 'cd /home/worker/checkouts/gecko && ln -s /home/worker/artifacts artifacts && ./mach --log-no-times taskgraph decision --pushlog-id=\'-1\' --pushdate=\'0\' --project=\'mozilla-central\' --message="no push -- cron task \'nightly-android\'" --owner=\'nobody@mozilla.org\' --level=\'3\' --base-repository=\'https://hg.mozilla.org/mozilla-central\' --head-repository=\'https://hg.mozilla.org/mozilla-central\' --head-ref=\'8dd496fd015a2b6e99573070279d9d1593836ea9\' --head-rev=\'8dd496fd015a2b6e99573070279d9d1593836ea9\' --triggered-by=nightly --target-tasks-method=nightly_fennec' \

3. In order to run the script, I had to change all "'" to """ in [2]. However, after doing that as well, I got this error:

/home/worker/bin/run-task: No such file or directory

I am not sure how to fix the error in [3].


Also, I wanted to let you know that I am having my report submission and presentation for my final year college project this Friday, so I will not be able to give enough time to this bug this week. I don't have any problem if someone is willing to work on this bug during that period.
Flags: needinfo?(dustin)
Those run-locally errors are brought up in bug 1366413, too.  In fact, "run locally" almost never works, since it doesn't support any of the interesting things that happen in TC automation. 

Instead of trying to run the decision task locally, I'd suggest making try pushes with print functions added to the decision-task code to print out what you need -- probably in taskcluster/taskgraph/docker.py and https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/util/docker.py to start with.  I'm guessing, but the issue might be that create_context_tar is generating files in a different order in the decision task and in a local development environment.

You can run an almost-no-op decision task with `try: -b o -j none -p linux64 -t none -u none`.

I really appreciate your letting me know about your time, thanks!  The big will be waiting for you when you return :)
Flags: needinfo?(dustin)
Hi Dustin,

Thanks for your consideration. Actually the college project work is done but some other important college work has popped up because of which I won't be able to give much time to moz bugs till next week. I'll surely work on this bug after next week if no one volunteers to work on this till then. :)
Attached patch docker_image_patch.diff (obsolete) — Splinter Review
Hi Dustin,

I have generated the following logs:

in-tree decision task log:
https://treeherder.mozilla.org/logviewer.html#?job_id=110390000&repo=try#L1012

local decision task log:
https://pastebin.mozilla.org/9025957

The lines to look for are those starting with "source_path: " and "fs_path: ".

Looks like you were right. The prints in logs are not in the same order.

What would be the best way to go about preserving the order? 
Maybe sort the archive_files list before calling create_tar_gz_from_files in http://searchfox.org/mozilla-central/source/taskcluster/taskgraph/util/docker.py#122 ?
Flags: needinfo?(dustin)
Hi Swapnesh -- sorry for the delay.

It looks like that's already sorted when creating the tar file:
  http://searchfox.org/mozilla-central/source/python/mozbuild/mozpack/archive.py#35
so I'm guessing something else is going awry?

If you run `tar -tzf` on the two tarballs, are the results in different order?
Flags: needinfo?(dustin)
Assignee: nobody → swapneshks
(In reply to Dustin J. Mitchell [:dustin] from comment #11)
> If you run `tar -tzf` on the two tarballs, are the results in different
> order?

Sorry, my laptop hard disk had failed which caused delay in replying.

I am able to get the local tarballs. 

For in-tree I'll be changing https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/util/docker.py#121
But I am having a doubt as to how will I get the copy of the tarball from in-tree run (from FTP or something else)?
Flags: needinfo?(dustin)
In-tree, you might add a bit of code to copy it to /home/worker/artifacts and then push to try.
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> In-tree, you might add a bit of code to copy it to /home/worker/artifacts
> and then push to try.

Thanks. I was able to retrieve the tarballs from in-tree run.
On running 'tar -tzf' on one such local and in-tree tarball (for index-task), I see the files are listed in the same order.
Is this happening because my local machine has Python 2.7.12 -- Ubuntu 16.04.2 (same as decision image)?
Flags: needinfo?(dustin)
Interesting! It might be time to dive in with a hex editor or a binary-diff tool and see what the differences are.  I think the tarballs are small, right? If so, consider attaching them to this bug?
Flags: needinfo?(dustin)
Attached image index-task_diff_line0-272.png (obsolete) —
Attached image index-task_diff_line288-560.png (obsolete) —
You might want to un-gzip those first :)
This hexdiff is after running gunzip on the tarball.
The differences in the attached image are occurring in all files (just before the content of each file starts).
Individual files are not showing differences (during one-one comparison).
nicely done -- can you figure out what those fields mean?  0644 looks like it might be a permissions value.  Indeed:

dustin@hopper ~ $ tar -vtf tmp/intree.tar 
-rw-r--r-- 0/0              69 2015-12-31 19:00 index-task/.eslintrc.js
-rw-r--r-- 0/0             263 2015-12-31 19:00 index-task/Dockerfile
-rw-r--r-- 0/0             849 2015-12-31 19:00 index-task/README
-rw-r--r-- 0/0            1406 2015-12-31 19:00 index-task/insert-indexes.js
-rw-r--r-- 0/0           10482 2015-12-31 19:00 index-task/npm-shrinkwrap.json
-rw-r--r-- 0/0             130 2015-12-31 19:00 index-task/package.json
dustin@hopper ~ $ tar -vtf tmp/local.tar 
-rw-rw-r-- 0/0              69 2015-12-31 19:00 index-task/.eslintrc.js
-rw-rw-r-- 0/0             263 2015-12-31 19:00 index-task/Dockerfile
-rw-rw-r-- 0/0             849 2015-12-31 19:00 index-task/README
-rw-rw-r-- 0/0            1406 2015-12-31 19:00 index-task/insert-indexes.js
-rw-rw-r-- 0/0           10482 2015-12-31 19:00 index-task/npm-shrinkwrap.json
-rw-rw-r-- 0/0             130 2015-12-31 19:00 index-task/package.json

but what about the other bytes?
(In reply to Dustin J. Mitchell [:dustin] from comment #23)
> nicely done -- can you figure out what those fields mean?  0644 looks like
> it might be a permissions value.  

Yes, I confirmed that those are the permissions.

> but what about the other bytes?

The other (mismatch) bytes are the header checksum - https://www.mkssoftware.com/docs/man4/tar.4.asp 
Seems like this is in turn being caused due to the permissions field mismatch.

I guess this is the relevant code (for permissions) - http://searchfox.org/mozilla-central/source/python/mozbuild/mozpack/archive.py#36-41
Flags: needinfo?(dustin)
Indeed!  So do we need to add an option to that function to "force" a particular mode?
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] from comment #25)
> Indeed!  So do we need to add an option to that function to "force" a
> particular mode?

Or maybe hard-code the permissions similar to http://searchfox.org/mozilla-central/source/python/mozbuild/mozpack/archive.py#55-58 ?
Will that affect something else?
I don't know what else uses mozpack, but I bet there's something, so it's probably best to only change behavior for this use -- hard coding would certainly affect other things, possibly causing issues.
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

https://reviewboard.mozilla.org/r/156560/#review161626

This looks like the right fix, just some notes on the implementation below.  Have you confirmed that this consistently produces identical hashes?

::: python/mozbuild/mozpack/archive.py:21
(Diff revision 1)
>  
>  # 2016-01-01T00:00:00+0000
>  DEFAULT_MTIME = 1451606400
>  
>  
> -def create_tar_from_files(fp, files):
> +def create_tar_from_files(fp, files, forcemode=False):

This should be documented in the function's docstring (and for `create_tar_gz_from_files`).

Also, it might make more sense to allow the caller to specify the mode, with `None` being the default, so `create_tar_from_files(.., forcemode=0664)`.

::: python/mozbuild/mozpack/archive.py:55
(Diff revision 1)
>              if ti.mode & (stat.S_ISUID | stat.S_ISGID):
>                  raise ValueError('cannot add file with setuid or setgid set: '
>                                   '%s' % f)
>  
> +            # Force file mode (permissions) if 'forcemode' is True
> +            if forcemode == True

We usually don't want to compare to True.  Rather, just `if forcemode` would work here.

::: python/mozbuild/mozpack/archive.py:56
(Diff revision 1)
>                  raise ValueError('cannot add file with setuid or setgid set: '
>                                   '%s' % f)
>  
> +            # Force file mode (permissions) if 'forcemode' is True
> +            if forcemode == True
> +                ti.mode = 0664

We should probably stick to the in-tree permissions (0644) which are more restrictive than the local permissions (0664, allowing group write access).
Attachment #8885761 - Flags: review?(dustin) → review-
I check for index-task and hexdiff does not show any differences.
(In reply to Swapnesh Kumar Sahoo [:swapneshks] from comment #31)
> Created attachment 8885826 [details]
> index-task_gunzip_post-commit_diff_line0-272.png
> 
> I check for index-task and hexdiff does not show any differences.

The sha256sum for both the tars is also same.
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

https://reviewboard.mozilla.org/r/156560/#review161742

Great work, thanks!

::: python/mozbuild/mozpack/archive.py:57
(Diff revision 2)
>              if ti.mode & (stat.S_ISUID | stat.S_ISGID):
>                  raise ValueError('cannot add file with setuid or setgid set: '
>                                   '%s' % f)
>  
> +            # Force file mode (permissions) if 'forcemode' is passed
> +            if forcemode is not None:

`is not None` is the right choice here because it means we could have forcemode = 0 (why someone would do that is another question).
Attachment #8885761 - Flags: review?(dustin) → review+
Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ad2fdd0912ed
Add option to force file mode during tar creation; r=dustin
Backed out for breaking taskcluster jobs, or not scheduling them:

https://hg.mozilla.org/integration/autoland/rev/14214440f7476cd0e16c2343f1c4580a9cf064ad

Push with bustage: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=ad2fdd0912ed5c5f3b45c4bb0db929dc3bbf4e8d
Build log: https://treeherder.mozilla.org/logviewer.html#?job_id=113759687&repo=autoland

[taskcluster 2017-07-12 19:38:58.686Z] === Task Starting ===

[taskcluster:error] Failure to properly start execution environment.

[taskcluster:error] (HTTP code 500) server error - invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"exec: \\\"/home/worker/bin/run-task\\\": permission denied\"\n" 
[taskcluster 2017-07-12 19:38:59.133Z] === Task Finished ===
[taskcluster 2017-07-12 19:38:59.202Z] Artifact "public/build" not found at "/home/worker/artifacts/"
[taskcluster 2017-07-12 19:38:59.595Z] Unsuccessful task run with exit code: -1 completed in 105.835 seconds

For the next job, Linux and OS X taskcluster jobs don't get scheduled: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=813550f3f7137fa1d4ef1c833d7bf5ab42970ada

Please fix the issues and resubmit.
Flags: needinfo?(swapneshks)
In [1], 0644 is 'or'ed with f.mode -- f.mode may have more permissions (which might be desired). But we are forcing 0644. Same may be the case with [2]. Is this a possible reason for the permission denied error?

[1] http://searchfox.org/mozilla-central/source/python/mozbuild/mozpack/archive.py#38
[2] http://searchfox.org/mozilla-central/source/python/mozbuild/mozpack/archive.py#41
Flags: needinfo?(swapneshks) → needinfo?(dustin)
The `or` operator in Python is boolean, not bitwise (that would be `|`).  But I think you've got the right idea.

From the error, I think we have two permissions that matter: executable (0755) and non-executable (0644).  We want to avoid accidentally using 0775 or 0664 for either of those.  That's probably a little too specific for `create_tar_from_file` to implement, but that function can take BaseFile instances, so maybe we could provide it with the right information in the form of a BaseFile object.

Looking at the history for mozpack, I see that :gps is the majority author, so perhaps he has some suggestions as to how best to solve this issue.

To summarize, the issue is that developer systems tend to have permissions with u=g like 0664, while automation has 0644.  Just forcing everything to 0644 loses the executable bit, leading to the issues in comment 35.
Flags: needinfo?(dustin) → needinfo?(gps)
First, this patch is to the build system and needs a build peer review.

Regarding mode normalization, yes, we need to preserve executable bit. I think stripping the write bit from other and group fields as part of archiving makes sense. That should leave us with either 0755 for executable files or 0644 for non-executable files. This normalization should only occur in the non-BaseFile branch, as BaseFile instances are presumed to come from something more reliable than the filesystem and we may wish to explicitly set the write bit on the group or other fields.
Flags: needinfo?(gps)
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

https://reviewboard.mozilla.org/r/156560/#review165474

::: python/mozbuild/mozpack/archive.py:40
(Diff revision 3)
>      """
>      with tarfile.open(name='', mode='w', fileobj=fp, dereference=True) as tf:
>          for archive_path, f in sorted(files.items()):
>              if isinstance(f, BaseFile):
>                  ti = tarfile.TarInfo(archive_path)
>                  ti.mode = f.mode or 0644

f.mode is of the format 0100xxx where xxx are the permissions. Should the 'or'ed 0644 be changed to 0100644 or is it intentional?
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

I have verified that hexdiff for in-tree and local tars does not show any differences.

Try job that was used to generate in-tree tars:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8496f173b23e599abc7797f5563a4dc4d6b645ca&selectedJob=116680679

Also, whom should I add for build peer review before the patch lands?
Flags: needinfo?(dustin)
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

https://reviewboard.mozilla.org/r/156560/#review165512

Regarding reviews: yes, absolutely needs a build peer review, and in fact :gps is the author of the whole file so he's probably a good person. Alternately, :glandium has reviewed all of those changes, so he could peek too.

One of those two can say for sure, but I wonder if it would make more sense to make the flag `vcs_modes` which limit the modes to *just* 0644 and 0755, based on the user-execute bit.  That's really what we're looking for here: the same behavior that git or hg have with respect to file permissions.

::: python/mozbuild/mozpack/archive.py:32
(Diff revision 3)
>      The files will be archived and written to the passed file handle opened
>      for writing.
>  
>      Only regular files can be written.
>  
> +    Mode (permissions) of files is forced if forcemode is True.

I think this should be modified a little, since it doesn't describe exactly what the code does anymore.  A more accurate description might be "If forcemode is true, group write permissions are stripped".

::: python/mozbuild/mozpack/archive.py:46
(Diff revision 3)
>                  ti.type = tarfile.REGTYPE
>              else:
>                  ti = tf.gettarinfo(f, archive_path)
> +                if ti.mode & 0000020 != 0:
> +                    # Force file mode (permissions) if 'forcemode' is True
> +                    if forcemode:

Bit-wise, there's no need to check the bit first:

    if forcemode:
        # strip the group-write permission bit
        ti.mode &= ~0020
Attachment #8885761 - Flags: review?(mh+mozilla)
Attachment #8885761 - Flags: review?(gps)
By the way, build peers are listed at https://wiki.mozilla.org/Modules/Core#Build_Config
Flags: needinfo?(dustin)
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

https://reviewboard.mozilla.org/r/156560/#review166124

::: python/mozbuild/mozpack/archive.py:44
(Diff revision 3)
> +                if ti.mode & 0000020 != 0:
> +                    # Force file mode (permissions) if 'forcemode' is True
> +                    if forcemode:
> +                        ti.mode &= ~(0000020)

This is way too limited a cleanup. The funny thing is... if f were a File, mode would already be cleaned up, per File.mode calling BaseFile.normalize_path.

So it seems to me the easy fix here is to just make this function create File instances when f is not already a BaseFile.
Attachment #8885761 - Flags: review?(mh+mozilla)
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

https://reviewboard.mozilla.org/r/156560/#review166488

One issue with the global use of `File` is that `File.mode` is `None` on Windows. So if this code runs on Windows, we'll lose the executable bit.

Of course, you can always fix that code in `mozpack.files` so the mode works on Windows.

::: python/mozbuild/mozpack/archive.py:41
(Diff revision 4)
> +                # Create a File object (to obtain normalized file mode)
> +                fi = File(f)
>                  ti = tf.gettarinfo(f, archive_path)
> +                ti.mode = fi.mode

The logic should be changed so the first thing this loop does is `if not isinstance(f, BaseFile)` and then creates a `File` from `f` and reassigns to `f`. Then all the `isinstance(f, Basefile)` checks and non-true branches can be removed.
(In reply to Gregory Szorc [:gps] from comment #47)
> Comment on attachment 8885761 [details]
> Bug 1347582 - Use File object to obtain normalized file mode;
> 
> https://reviewboard.mozilla.org/r/156560/#review166488
> 
> One issue with the global use of `File` is that `File.mode` is `None` on
> Windows. So if this code runs on Windows, we'll lose the executable bit.

There's no executable bit to pick up in the first place, on Windows.
Attachment #8881866 - Attachment is obsolete: true
Attachment #8885301 - Attachment is obsolete: true
Attachment #8885302 - Attachment is obsolete: true
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

https://reviewboard.mozilla.org/r/156560/#review167106

This seems reasonable!
Attachment #8885761 - Flags: review?(gps) → review+
Comment on attachment 8885761 [details]
Bug 1347582 - Use File object to obtain normalized file mode;

https://reviewboard.mozilla.org/r/156560/#review167108

I can't trigger autoland on this because it has open issues. Please drop or resolve open issues and then it can be landed.

(Yes, it is silly that the review tool won't allow me to close open issues.)
(In reply to Swapnesh Kumar Sahoo [:swapneshks] from comment #40)
> Comment on attachment 8885761 [details]
> Bug 1347582 - Use File object to obtain normalized file mode;
> 
> https://reviewboard.mozilla.org/r/156560/#review165474
> 
> ::: python/mozbuild/mozpack/archive.py:40
> (Diff revision 3)
> >      """
> >      with tarfile.open(name='', mode='w', fileobj=fp, dereference=True) as tf:
> >          for archive_path, f in sorted(files.items()):
> >              if isinstance(f, BaseFile):
> >                  ti = tarfile.TarInfo(archive_path)
> >                  ti.mode = f.mode or 0644
> 
> f.mode is of the format 0100xxx where xxx are the permissions. Should the
> 'or'ed 0644 be changed to 0100644 or is it intentional?

Before I mark this as dropped, any comments on this?
Flags: needinfo?(gps)
python's tarfile doesn't use the high bits of the mode.
Flags: needinfo?(gps)
Thanks! All open issues have been marked as fixed/dropped.
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2c7389aef3bc
Use File object to obtain normalized file mode; r=dustin,gps
https://hg.mozilla.org/mozilla-central/rev/2c7389aef3bc
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Attachment #8885761 - Flags: review?(mh+mozilla)
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.