Closed Bug 1616845 Opened 3 years ago Closed 3 years ago

Near-perma "OSError: symbolic link privilege not held" in fetch-content

Categories

(Infrastructure & Operations :: RelOps: Hardware, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sparky, Unassigned)

References

(Blocks 2 open bugs)

Details

This is for an intermittent bug that has been popping up in Browsertime tests (I can't find it anywhere else). Here's the error log:

[fetches 2020-02-19T00:16:28.468Z]   inflating: ffmpeg-4.1.1-win64-static/README.txt  
[fetches 2020-02-19T00:16:28.468Z]   inflating: ffmpeg-4.1.1-win64-static/LICENSE.txt  
[fetches 2020-02-19T00:16:28.469Z]    creating: ffmpeg-4.1.1-win64-static/presets/
[fetches 2020-02-19T00:16:28.469Z]   inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-1080p50_60.ffpreset  
[fetches 2020-02-19T00:16:28.470Z]   inflating: ffmpeg-4.1.1-win64-static/presets/ffprobe.xsd  
[fetches 2020-02-19T00:16:28.471Z]   inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-720p.ffpreset  
[fetches 2020-02-19T00:16:28.471Z]   inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-360p.ffpreset  
[fetches 2020-02-19T00:16:28.472Z]   inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-1080p.ffpreset  
[fetches 2020-02-19T00:16:28.472Z]   inflating: ffmpeg-4.1.1-win64-static/presets/libvpx-720p50_60.ffpreset  
[fetches 2020-02-19T00:16:28.474Z] C:\Users\task_1582070133\fetches\ffmpeg-4.1.1-win64-static.zip extracted in 1.605s
[fetches 2020-02-19T00:16:28.474Z] Removing C:\Users\task_1582070133\fetches\ffmpeg-4.1.1-win64-static.zip
[fetches 2020-02-19T00:16:31.007Z] C:\Users\task_1582070133\fetches\node.tar.bz2 extracted in 5.397s
[fetches 2020-02-19T00:16:31.009Z] Removing C:\Users\task_1582070133\fetches\node.tar.bz2
[fetches 2020-02-19T00:16:31.011Z] Traceback (most recent call last):
[fetches 2020-02-19T00:16:31.011Z]   File "C:\mozilla-build\python3\lib\tarfile.py", line 2210, in makelink
[fetches 2020-02-19T00:16:31.012Z]     os.symlink(tarinfo.linkname, targetpath)
[fetches 2020-02-19T00:16:31.012Z] OSError: symbolic link privilege not held
[fetches 2020-02-19T00:16:31.012Z] 
[fetches 2020-02-19T00:16:31.012Z] During handling of the above exception, another exception occurred:
[fetches 2020-02-19T00:16:31.012Z] 
[fetches 2020-02-19T00:16:31.012Z] Traceback (most recent call last):
[fetches 2020-02-19T00:16:31.012Z]   File "fetch-content", line 663, in <module>
[fetches 2020-02-19T00:16:31.013Z]     sys.exit(main())
[fetches 2020-02-19T00:16:31.013Z]   File "fetch-content", line 659, in main
[fetches 2020-02-19T00:16:31.013Z]     return args.func(args)
[fetches 2020-02-19T00:16:31.013Z]   File "fetch-content", line 606, in command_task_artifacts
[fetches 2020-02-19T00:16:31.013Z]     fetch_urls(downloads)
[fetches 2020-02-19T00:16:31.013Z]   File "fetch-content", line 482, in fetch_urls
[fetches 2020-02-19T00:16:31.013Z]     f.result()
[fetches 2020-02-19T00:16:31.013Z]   File "C:\mozilla-build\python3\lib\concurrent\futures\_base.py", line 432, in result
[fetches 2020-02-19T00:16:31.014Z]     return self.__get_result()
[fetches 2020-02-19T00:16:31.014Z]   File "C:\mozilla-build\python3\lib\concurrent\futures\_base.py", line 384, in __get_result
[fetches 2020-02-19T00:16:31.014Z]     raise self._exception
[fetches 2020-02-19T00:16:31.014Z]   File "C:\mozilla-build\python3\lib\concurrent\futures\thread.py", line 56, in run
[fetches 2020-02-19T00:16:31.015Z]     result = self.fn(*self.args, **self.kwargs)
[fetches 2020-02-19T00:16:31.015Z]   File "fetch-content", line 468, in fetch_and_extract
[fetches 2020-02-19T00:16:31.015Z]     extract_archive(dest_path, dest_dir, typ)
[fetches 2020-02-19T00:16:31.015Z]   File "fetch-content", line 329, in extract_archive
[fetches 2020-02-19T00:16:31.016Z]     tar.extractall(str(dest_dir))
[fetches 2020-02-19T00:16:31.016Z]   File "C:\mozilla-build\python3\lib\tarfile.py", line 2008, in extractall
[fetches 2020-02-19T00:16:31.016Z]     numeric_owner=numeric_owner)
[fetches 2020-02-19T00:16:31.016Z]   File "C:\mozilla-build\python3\lib\tarfile.py", line 2050, in extract
[fetches 2020-02-19T00:16:31.018Z]     numeric_owner=numeric_owner)
[fetches 2020-02-19T00:16:31.018Z]   File "C:\mozilla-build\python3\lib\tarfile.py", line 2128, in _extract_member
[fetches 2020-02-19T00:16:31.018Z]     self.makelink(tarinfo, targetpath)
[fetches 2020-02-19T00:16:31.019Z]   File "C:\mozilla-build\python3\lib\tarfile.py", line 2221, in makelink
[fetches 2020-02-19T00:16:31.019Z]     targetpath)
[fetches 2020-02-19T00:16:31.020Z]   File "C:\mozilla-build\python3\lib\tarfile.py", line 2120, in _extract_member
[fetches 2020-02-19T00:16:31.020Z]     self.makefile(tarinfo, targetpath)
[fetches 2020-02-19T00:16:31.020Z]   File "C:\mozilla-build\python3\lib\tarfile.py", line 2159, in makefile
[fetches 2020-02-19T00:16:31.021Z]     source.seek(tarinfo.offset_data)
[fetches 2020-02-19T00:16:31.021Z]   File "C:\mozilla-build\python3\lib\tarfile.py", line 522, in seek
[fetches 2020-02-19T00:16:31.021Z]     raise StreamError("seeking backwards is not allowed")
[fetches 2020-02-19T00:16:31.021Z] tarfile.StreamError: seeking backwards is not allowed
[fetches 2020-02-19T00:16:31.045Z] removing C:/Users/task_1582070133/fetches
[fetches 2020-02-19T00:16:32.221Z] finished
[taskcluster 2020-02-19T00:16:32.236Z]    Exit Code: 1

The failure can be found here: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&group_state=expanded&tier=1%2C2%2C3&searchStr=browsertime&revision=df596657bebcb96b917d75ff452316bbe8140a1a&selectedJob=289436219

The actually relevant part of the log is OSError: symbolic link privilege not held. If it's intermittent, it would tend to mean workers are not consistently configured.

Component: Task Configuration → Workers
Product: Firefox Build System → Taskcluster
Component: Workers → RelOps: Hardware
Product: Taskcluster → Infrastructure & Operations

I've updated the title to use the relevant part of the log.

Summary: Intermittent "tarfile.StreamError: seeking backwards is not allowed" in fetch-content → Intermittent "OSError: symbolic link privilege not held" in fetch-content

I had one occurence with this patch:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=0c5165dcfacabdf5aa12af1c3805f498279a4ba2&selectedTaskRun=SY196oCaRze-8ZDu2Y4kMQ.0

it's happening on source checkout there.

Dave, this is with the perftest patch that now works on talos-linux python38, that I am trying to have working on windows now

Flags: needinfo?(dhouse)

(In reply to Tarek Ziadé (:tarek) from comment #5)

I had one occurence with this patch:

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=0c5165dcfacabdf5aa12af1c3805f498279a4ba2&selectedTaskRun=SY196oCaRze-8ZDu2Y4kMQ.0

it's happening on source checkout there.

Dave, this is with the perftest patch that now works on talos-linux python38, that I am trying to have working on windows now

Is this a new task? I'm guessing this is the new task we are changing the linux64 configuration to run (bug 1647648). If so, then we'll need one of the Windows engineers and/or to get a loaner set up for dev and figuring out config changes.

redirecting to :markco for debugging the windows-hw config for the new task(s).

Flags: needinfo?(dhouse) → needinfo?(mcornmesser)

"OSError: symbolic link privilege not held" leads me t believe there is a race condition in which is affecting the privileges held by the task user. I will spin up a test pool in the later part of this week and see if I can recreate the error.

Flags: needinfo?(mcornmesser)

(In reply to Dave House [:dhouse] from comment #6)

Is this a new task? I'm guessing this is the new task we are changing the linux64 configuration to run (bug 1647648). If so, then we'll need one of the Windows engineers and/or to get a loaner set up for dev and figuring out config changes.

Yeah

redirecting to :markco for debugging the windows-hw config for the new task(s).

thanks,.

Blocks: 1644778
Summary: Intermittent "OSError: symbolic link privilege not held" in fetch-content → Near-perma "OSError: symbolic link privilege not held" in fetch-content

I am wondering if it's not a TC race issue, I am seeing some errors on macOS generic workers as well, which are also happening while grabbing/decompressing the node tarball

https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=2648020922beeba2a4f41bc654fabfcae28c5c3c&selectedTaskRun=afVFru2oSa-r5zBxYUDwRQ.0

Blocks: 1651043

I think the problem comes from this thread pool executor: https://searchfox.org/mozilla-central/source/taskcluster/scripts/misc/fetch-content#475

on macOs at least, moving back to a synchronous loop fixed my bug. trying to run the win version to see if it fixes it

notice that all logs are mixed up because of that threadpool executor. which makes it hard to debug and understand when it breaks.

Adding a couple of people to get their opinions

Flags: needinfo?(mozilla)
Flags: needinfo?(ahal)

I'm not familiar enough with fetch-content to say whether making it synchronous is acceptable or not. I assume the answer would be "no" and we'd want an alternate fix, but :tomprince or :glandium can probably answer better.

Flags: needinfo?(ahal)

ok thanks. I am trying to change the pool to processes instead of threads to see if it helps.

Flags: needinfo?(mh+mozilla)

This is my workaround:

 
 def fetch_urls(downloads):
     """Fetch URLs pairs to a pathlib.Path."""
+    if 'MOZ_SYNC_FETCHES' in os.environ:
+        for download in downloads:
+            fetch_and_extract(*download)
+        return
     with concurrent.futures.ThreadPoolExecutor(CONCURRENCY) as e:
         fs = []
-
         for download in downloads:
             fs.append(e.submit(fetch_and_extract, *download))
-
         for f in fs:
             f.result()

I guess we could use MOZ_SYNC_FETCHES with caution on talos only?

(In reply to Tarek Ziadé (:tarek) from comment #17)

nope https://firefoxci.taskcluster-artifacts.net/CvmCQW6gTM-zdrY9nMsfhw/0/public/logs/live_backing.log

This is not at all the same error as comment 0.

Flags: needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #19)

(In reply to Tarek Ziadé (:tarek) from comment #17)

nope https://firefoxci.taskcluster-artifacts.net/CvmCQW6gTM-zdrY9nMsfhw/0/public/logs/live_backing.log

This is not at all the same error as comment 0.

we suspect it is the same nature (race conditions)

It seems like I am hitting this error while extracing the browsertime.tar.gz archive. Synchronous downloads/extracts as suggested by comment #18 don't solve this but it does make it easier to debug.

One thing I'm wondering if it's possible that the Windows permissions required to create symlinks is intermittent on our windows machines. :dhouse, would you have some way to check this?

Flags: needinfo?(mozilla) → needinfo?(dhouse)
Flags: needinfo?(rthijssen)

+1 for redirect of NI to :grenade. I'm not sure how to check for issues with symlinks on Windows.

Flags: needinfo?(dhouse)

(In reply to Greg Mierzwinski [:sparky] from comment #21)

One thing I'm wondering if it's possible that the Windows permissions required to create symlinks is intermittent on our windows machines. :dhouse, would you have some way to check this?

yes. i'd say that's quite possible, probably likely. the fix (for cloud instances) won't be quick as we are caught between image deployment mechanisms just now (another cloud migration). the symlink permission granting mechanism relies on an upstream powershell module (Carbon) which has moved from it's currently configured source, to github, which is the likeliest reason for this bustage. images which have carbon baked in (everything in ec2), are probably fine but hardware that has been recently reimaged will definitely have issues.

:markco: dustin mentions gecko-t-win10-64-hw in comment 2, do you know if this worker type relies on bitbucket carbon for symlinks?

Flags: needinfo?(rthijssen) → needinfo?(mcornmesser)

:markco: dustin mentions gecko-t-win10-64-hw in comment 2, do you know if this worker type relies on bitbucket carbon for symlinks?

It looks there is a possible issue with how Puppet is calling the bitbucket carbon commands. I will take a look at it this morning.

Flags: needinfo?(mcornmesser)

I had to update the carbon package. It was missing a few files. Within the next 30 minutes the gecko-t-win10-64-dev workers will have the updated packages. Could someone test on the those workers to see if this resolves the problem?

Sorry that this fail off my radar. After comment 14 I thought this was no longer an issue.

Thanks :markco! I'll start some browsertime tests shortly to see if it's still an issue.

Flags: needinfo?(gmierz2)
Flags: needinfo?(gmierz2) → needinfo?(mcornmesser)

(In reply to Greg Mierzwinski [:sparky] from comment #27)

:markco it seems like there are still some machines that are broken: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=b266b48ee2bf8da618838b7c492c526fcac131c7&selectedTaskRun=A5Ck1sdQSAaKuNRceCN2bw.0

This had only been applied to a limited pool with worker type gecko-t-win10-64-dev. I want to make sure updating the packaged resolved the issue before pushing this out to the larger production pool with worker type gecko-t-win10-64-1803-hw.

Flags: needinfo?(mcornmesser)

Ah sorry! I missed that. I'll make a new try run using those workers shortly.

:markco, the tests on the dev pool look good to me (the windows 7 failures are unrelated): https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=f2e6d0e565792dea30a275427bedf9ecbe76fce1

Flags: needinfo?(mcornmesser)

I am hoping that this will trigger the nodes to download and reinstall carbon https://github.com/mozilla-platform-ops/ronin_puppet/pull/223/files. The current nodes will all pick up these changes with in the next hour to 2. There is a bit of challenge with masterless puppet and updating the zip packages. If this fails the update will be picked up as I redeploy the datacenter workers this week and next.

Flags: needinfo?(mcornmesser)
Blocks: 1667396
Blocks: 1667398

I haven't seen this error in a while now, thanks for fixing this :markco!

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.