Near-perma "OSError: symbolic link privilege not held" in fetch-content
Categories
(Infrastructure & Operations :: RelOps: Hardware, defect, P2)
Tracking
(Not tracked)
People
(Reporter: sparky, Unassigned)
References
(Blocks 2 open bugs)
Details
This is for an intermittent bug that has been popping up in Browsertime tests (I can't find it anywhere else). Here's the error log:
[fetches 2020-02-19T00:16:28.468Z] inflating: ffmpeg-4.1.1-win64-static/README.txt
[fetches 2020-02-19T00:16:28.468Z] inflating: ffmpeg-4.1.1-win64-static/LICENSE.txt
[fetches 2020-02-19T00:16:28.469Z] creating: ffmpeg-4.1.1-win64-static/presets/
[fetches 2020-02-19T00:16:28.469Z] inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-1080p50_60.ffpreset
[fetches 2020-02-19T00:16:28.470Z] inflating: ffmpeg-4.1.1-win64-static/presets/ffprobe.xsd
[fetches 2020-02-19T00:16:28.471Z] inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-720p.ffpreset
[fetches 2020-02-19T00:16:28.471Z] inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-360p.ffpreset
[fetches 2020-02-19T00:16:28.472Z] inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-1080p.ffpreset
[fetches 2020-02-19T00:16:28.472Z] inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-720p50_60.ffpreset
[fetches 2020-02-19T00:16:28.474Z] C:\Users\task_1582070133\fetches\ffmpeg-4.1.1-win64-static.zip extracted in 1.605s
[fetches 2020-02-19T00:16:28.474Z] Removing C:\Users\task_1582070133\fetches\ffmpeg-4.1.1-win64-static.zip
[fetches 2020-02-19T00:16:31.007Z] C:\Users\task_1582070133\fetches\node.tar.bz2 extracted in 5.397s
[fetches 2020-02-19T00:16:31.009Z] Removing C:\Users\task_1582070133\fetches\node.tar.bz2
[fetches 2020-02-19T00:16:31.011Z] Traceback (most recent call last):
[fetches 2020-02-19T00:16:31.011Z] File "C:\mozilla-build\python3\lib\tarfile.py", line 2210, in makelink
[fetches 2020-02-19T00:16:31.012Z] os.symlink(tarinfo.linkname, targetpath)
[fetches 2020-02-19T00:16:31.012Z] OSError: symbolic link privilege not held
[fetches 2020-02-19T00:16:31.012Z]
[fetches 2020-02-19T00:16:31.012Z] During handling of the above exception, another exception occurred:
[fetches 2020-02-19T00:16:31.012Z]
[fetches 2020-02-19T00:16:31.012Z] Traceback (most recent call last):
[fetches 2020-02-19T00:16:31.012Z] File "fetch-content", line 663, in <module>
[fetches 2020-02-19T00:16:31.013Z] sys.exit(main())
[fetches 2020-02-19T00:16:31.013Z] File "fetch-content", line 659, in main
[fetches 2020-02-19T00:16:31.013Z] return args.func(args)
[fetches 2020-02-19T00:16:31.013Z] File "fetch-content", line 606, in command_task_artifacts
[fetches 2020-02-19T00:16:31.013Z] fetch_urls(downloads)
[fetches 2020-02-19T00:16:31.013Z] File "fetch-content", line 482, in fetch_urls
[fetches 2020-02-19T00:16:31.013Z] f.result()
[fetches 2020-02-19T00:16:31.013Z] File "C:\mozilla-build\python3\lib\concurrent\futures\_base.py", line 432, in result
[fetches 2020-02-19T00:16:31.014Z] return self.__get_result()
[fetches 2020-02-19T00:16:31.014Z] File "C:\mozilla-build\python3\lib\concurrent\futures\_base.py", line 384, in __get_result
[fetches 2020-02-19T00:16:31.014Z] raise self._exception
[fetches 2020-02-19T00:16:31.014Z] File "C:\mozilla-build\python3\lib\concurrent\futures\thread.py", line 56, in run
[fetches 2020-02-19T00:16:31.015Z] result = self.fn(*self.args, **self.kwargs)
[fetches 2020-02-19T00:16:31.015Z] File "fetch-content", line 468, in fetch_and_extract
[fetches 2020-02-19T00:16:31.015Z] extract_archive(dest_path, dest_dir, typ)
[fetches 2020-02-19T00:16:31.015Z] File "fetch-content", line 329, in extract_archive
[fetches 2020-02-19T00:16:31.016Z] tar.extractall(str(dest_dir))
[fetches 2020-02-19T00:16:31.016Z] File "C:\mozilla-build\python3\lib\tarfile.py", line 2008, in extractall
[fetches 2020-02-19T00:16:31.016Z] numeric_owner=numeric_owner)
[fetches 2020-02-19T00:16:31.016Z] File "C:\mozilla-build\python3\lib\tarfile.py", line 2050, in extract
[fetches 2020-02-19T00:16:31.018Z] numeric_owner=numeric_owner)
[fetches 2020-02-19T00:16:31.018Z] File "C:\mozilla-build\python3\lib\tarfile.py", line 2128, in _extract_member
[fetches 2020-02-19T00:16:31.018Z] self.makelink(tarinfo, targetpath)
[fetches 2020-02-19T00:16:31.019Z] File "C:\mozilla-build\python3\lib\tarfile.py", line 2221, in makelink
[fetches 2020-02-19T00:16:31.019Z] targetpath)
[fetches 2020-02-19T00:16:31.020Z] File "C:\mozilla-build\python3\lib\tarfile.py", line 2120, in _extract_member
[fetches 2020-02-19T00:16:31.020Z] self.makefile(tarinfo, targetpath)
[fetches 2020-02-19T00:16:31.020Z] File "C:\mozilla-build\python3\lib\tarfile.py", line 2159, in makefile
[fetches 2020-02-19T00:16:31.021Z] source.seek(tarinfo.offset_data)
[fetches 2020-02-19T00:16:31.021Z] File "C:\mozilla-build\python3\lib\tarfile.py", line 522, in seek
[fetches 2020-02-19T00:16:31.021Z] raise StreamError("seeking backwards is not allowed")
[fetches 2020-02-19T00:16:31.021Z] tarfile.StreamError: seeking backwards is not allowed
[fetches 2020-02-19T00:16:31.045Z] removing C:/Users/task_1582070133/fetches
[fetches 2020-02-19T00:16:32.221Z] finished
[taskcluster 2020-02-19T00:16:32.236Z] Exit Code: 1
The failure can be found here: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&group_state=expanded&tier=1%2C2%2C3&searchStr=browsertime&revision=df596657bebcb96b917d75ff452316bbe8140a1a&selectedJob=289436219
Comment 1•4 years ago
|
||
The actually relevant part of the log is OSError: symbolic link privilege not held
. If it's intermittent, it would tend to mean workers are not consistently configured.
Comment 2•4 years ago
|
||
https://firefox-ci-tc.services.mozilla.com/tasks/P1mSH2m1SrK7L1j5K341qQ ran on worker pool gecko-t-win10-64-hw
Reporter | ||
Comment 3•4 years ago
|
||
I've updated the title to use the relevant part of the log.
Comment hidden (Intermittent Failures Robot) |
Comment 5•4 years ago
|
||
I had one occurence with this patch:
it's happening on source checkout there.
Dave, this is with the perftest
patch that now works on talos-linux python38, that I am trying to have working on windows now
(In reply to Tarek Ziadé (:tarek) from comment #5)
I had one occurence with this patch:
it's happening on source checkout there.
Dave, this is with the
perftest
patch that now works on talos-linux python38, that I am trying to have working on windows now
Is this a new task? I'm guessing this is the new task we are changing the linux64 configuration to run (bug 1647648). If so, then we'll need one of the Windows engineers and/or to get a loaner set up for dev and figuring out config changes.
redirecting to :markco for debugging the windows-hw config for the new task(s).
Comment 7•3 years ago
|
||
"OSError: symbolic link privilege not held" leads me t believe there is a race condition in which is affecting the privileges held by the task user. I will spin up a test pool in the later part of this week and see if I can recreate the error.
Comment 8•3 years ago
|
||
Comment 9•3 years ago
|
||
(In reply to Dave House [:dhouse] from comment #6)
Is this a new task? I'm guessing this is the new task we are changing the linux64 configuration to run (bug 1647648). If so, then we'll need one of the Windows engineers and/or to get a loaner set up for dev and figuring out config changes.
Yeah
redirecting to :markco for debugging the windows-hw config for the new task(s).
thanks,.
Reporter | ||
Comment 10•3 years ago
|
||
On my latest browsertime desktop test, this is nearly a perma-fail: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=2b4f8e91252656f389a85f2be2c3a9183dca0f32
Comment 11•3 years ago
|
||
I am wondering if it's not a TC race issue, I am seeing some errors on macOS generic workers as well, which are also happening while grabbing/decompressing the node tarball
Comment 12•3 years ago
|
||
I think the problem comes from this thread pool executor: https://searchfox.org/mozilla-central/source/taskcluster/scripts/misc/fetch-content#475
on macOs at least, moving back to a synchronous loop fixed my bug. trying to run the win version to see if it fixes it
Comment 13•3 years ago
|
||
notice that all logs are mixed up because of that threadpool executor. which makes it hard to debug and understand when it breaks.
Adding a couple of people to get their opinions
Comment 14•3 years ago
|
||
I don't get it anymore so far
Comment 15•3 years ago
|
||
I'm not familiar enough with fetch-content
to say whether making it synchronous is acceptable or not. I assume the answer would be "no" and we'd want an alternate fix, but :tomprince or :glandium can probably answer better.
Comment 16•3 years ago
|
||
ok thanks. I am trying to change the pool to processes instead of threads to see if it helps.
Comment 17•3 years ago
|
||
Comment 18•3 years ago
|
||
This is my workaround:
def fetch_urls(downloads):
"""Fetch URLs pairs to a pathlib.Path."""
+ if 'MOZ_SYNC_FETCHES' in os.environ:
+ for download in downloads:
+ fetch_and_extract(*download)
+ return
with concurrent.futures.ThreadPoolExecutor(CONCURRENCY) as e:
fs = []
-
for download in downloads:
fs.append(e.submit(fetch_and_extract, *download))
-
for f in fs:
f.result()
I guess we could use MOZ_SYNC_FETCHES with caution on talos only?
Comment 19•3 years ago
|
||
(In reply to Tarek Ziadé (:tarek) from comment #17)
nope https://firefoxci.taskcluster-artifacts.net/CvmCQW6gTM-zdrY9nMsfhw/0/public/logs/live_backing.log
This is not at all the same error as comment 0.
Comment 20•3 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #19)
(In reply to Tarek Ziadé (:tarek) from comment #17)
nope https://firefoxci.taskcluster-artifacts.net/CvmCQW6gTM-zdrY9nMsfhw/0/public/logs/live_backing.log
This is not at all the same error as comment 0.
we suspect it is the same nature (race conditions)
Reporter | ||
Comment 21•3 years ago
|
||
It seems like I am hitting this error while extracing the browsertime.tar.gz archive. Synchronous downloads/extracts as suggested by comment #18 don't solve this but it does make it easier to debug.
One thing I'm wondering if it's possible that the Windows permissions required to create symlinks is intermittent on our windows machines. :dhouse, would you have some way to check this?
Updated•3 years ago
|
Comment 22•3 years ago
|
||
+1 for redirect of NI to :grenade. I'm not sure how to check for issues with symlinks on Windows.
Comment 23•3 years ago
|
||
(In reply to Greg Mierzwinski [:sparky] from comment #21)
One thing I'm wondering if it's possible that the Windows permissions required to create symlinks is intermittent on our windows machines. :dhouse, would you have some way to check this?
yes. i'd say that's quite possible, probably likely. the fix (for cloud instances) won't be quick as we are caught between image deployment mechanisms just now (another cloud migration). the symlink permission granting mechanism relies on an upstream powershell module (Carbon) which has moved from it's currently configured source, to github, which is the likeliest reason for this bustage. images which have carbon baked in (everything in ec2), are probably fine but hardware that has been recently reimaged will definitely have issues.
:markco: dustin mentions gecko-t-win10-64-hw in comment 2, do you know if this worker type relies on bitbucket carbon for symlinks?
Comment 24•3 years ago
|
||
:markco: dustin mentions gecko-t-win10-64-hw in comment 2, do you know if this worker type relies on bitbucket carbon for symlinks?
It looks there is a possible issue with how Puppet is calling the bitbucket carbon commands. I will take a look at it this morning.
Comment 25•3 years ago
|
||
I had to update the carbon package. It was missing a few files. Within the next 30 minutes the gecko-t-win10-64-dev workers will have the updated packages. Could someone test on the those workers to see if this resolves the problem?
Sorry that this fail off my radar. After comment 14 I thought this was no longer an issue.
Reporter | ||
Comment 26•3 years ago
|
||
Thanks :markco! I'll start some browsertime tests shortly to see if it's still an issue.
Reporter | ||
Comment 27•3 years ago
|
||
:markco it seems like there are still some machines that are broken: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=b266b48ee2bf8da618838b7c492c526fcac131c7&selectedTaskRun=A5Ck1sdQSAaKuNRceCN2bw.0
Comment 28•3 years ago
|
||
(In reply to Greg Mierzwinski [:sparky] from comment #27)
:markco it seems like there are still some machines that are broken: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=b266b48ee2bf8da618838b7c492c526fcac131c7&selectedTaskRun=A5Ck1sdQSAaKuNRceCN2bw.0
This had only been applied to a limited pool with worker type gecko-t-win10-64-dev. I want to make sure updating the packaged resolved the issue before pushing this out to the larger production pool with worker type gecko-t-win10-64-1803-hw.
Reporter | ||
Comment 29•3 years ago
|
||
Ah sorry! I missed that. I'll make a new try run using those workers shortly.
Reporter | ||
Comment 30•3 years ago
|
||
:markco, the tests on the dev pool look good to me (the windows 7 failures are unrelated): https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=f2e6d0e565792dea30a275427bedf9ecbe76fce1
Comment 31•3 years ago
|
||
I am hoping that this will trigger the nodes to download and reinstall carbon https://github.com/mozilla-platform-ops/ronin_puppet/pull/223/files. The current nodes will all pick up these changes with in the next hour to 2. There is a bit of challenge with masterless puppet and updating the zip packages. If this fails the update will be picked up as I redeploy the datacenter workers this week and next.
Reporter | ||
Comment 32•3 years ago
|
||
I haven't seen this error in a while now, thanks for fixing this :markco!
Description
•