Closed Bug 732025 Opened 8 years ago Closed 8 years ago

post_upload.py should use hardlinks where possible

Categories

(Release Engineering :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

(Whiteboard: [ftp][postupload])

Attachments

(2 files, 1 obsolete file)

In many cases, post_upload.py is copying the same file to multiple locations on disk. It would be more efficient to use hard links if we're copying to the same partition.
Component: Release Engineering → Release Engineering: Automation
QA Contact: release → catlee
Severity: normal → enhancement
Priority: -- → P3
Attached patch try to hardlink first (obsolete) — Splinter Review
this maintains a cache per run of post_upload.py of which files have been copied in, and tries to use hardlinks when the same file is being copied multiple times.

os.link raises an OSError in the case where new_file already exists, or if new_file and _linkCache[original_file] are on different partitions. In either case we fall back to the old style behaviour.

I've switched the old style behaviour to using temporary files in the destination directory instead of temporary sub directories, because I think this will result in less churn on the directory itself.
Attachment #603451 - Flags: feedback?
Attachment #603451 - Flags: feedback? → feedback?(nrthomas)
Comment on attachment 603451 [details] [diff] [review]
try to hardlink first

>@@ -60,22 +63,32 @@ def CopyFileToDir(original_file, source_dir, dest_dir, preserve_dirs=False):
>-            return 
>-    tmpdir = tempfile.mkdtemp(prefix=".~", dir=dest_dir)
>-    tmp_path = os.path.join(tmpdir, os.path.basename(new_file))
>-    shutil.copyfile(original_file, tmp_path)
[snip]
>+    tmp_fd, tmp_path = tempfile.mkstemp(dir=dest_dir)
>+    tmp_fp = os.fdopen(tmp_fd, 'wb')
>+    shutil.copyfileobj(open(original_file, 'rb'), tmp_fp)
>+    tmp_fp.close()
>     os.rename(tmp_path, new_file)
>-    os.rmdir(tmpdir)

So this change would avoid the temporary directory and just have a temp file. Are there any other benefits ? What is fp short for ?

>+    _linkCache[original_file] = new_file

original_file is an absolute path, so if we call CopyFileToDir several times on how do we get any hits on the cache ? Probably would work if you basename first.
Comment on attachment 603451 [details] [diff] [review]
try to hardlink first

feedback+ on the idea, but see the questions on the implementation in the previous comment. 

Looks like this will mostly help with the duplication of nightlies in the dated, latest, and tinderbox-builds. I did a quick test and moving them out of tinderbox-builds/latest and the dated file is fine, as you'd expect.
Attachment #603451 - Flags: feedback?(nrthomas) → feedback+
(In reply to Nick Thomas [:nthomas] from comment #2)
> Comment on attachment 603451 [details] [diff] [review]
> try to hardlink first
> 
> >@@ -60,22 +63,32 @@ def CopyFileToDir(original_file, source_dir, dest_dir, preserve_dirs=False):
> >-            return 
> >-    tmpdir = tempfile.mkdtemp(prefix=".~", dir=dest_dir)
> >-    tmp_path = os.path.join(tmpdir, os.path.basename(new_file))
> >-    shutil.copyfile(original_file, tmp_path)
> [snip]
> >+    tmp_fd, tmp_path = tempfile.mkstemp(dir=dest_dir)
> >+    tmp_fp = os.fdopen(tmp_fd, 'wb')
> >+    shutil.copyfileobj(open(original_file, 'rb'), tmp_fp)
> >+    tmp_fp.close()
> >     os.rename(tmp_path, new_file)
> >-    os.rmdir(tmpdir)
> 
> So this change would avoid the temporary directory and just have a temp
> file. Are there any other benefits ? What is fp short for ?

I believe that not using a temporary directory reduces the number of extra directory entries in the parent. When using temp dirs you have at most one temp dir and one file per file actively being copied in. When using temp files, I think you only have the one temp file. I hope this reduces the churn on the directory file itself.

> >+    _linkCache[original_file] = new_file
> 
> original_file is an absolute path, so if we call CopyFileToDir several times
> on how do we get any hits on the cache ? Probably would work if you basename
> first.

original_file would be a path in /tmp, wherever 'make upload' put them. it will be the same for multiple calls to CopyFileToDir in a single invocation of post_upload.py
Assignee: nobody → catlee
Ok, sounds good.
Comment on attachment 603451 [details] [diff] [review]
try to hardlink first

staging run and local testing look good.
Attachment #603451 - Flags: review?(nrthomas)
Attachment #603451 - Flags: review?(nrthomas) → review+
Attached patch hardlink harderSplinter Review
similar to before except keep a history of previous locations we've copied to instead of just the most recent one. this protects against worst case scenarios where a file is copied to partitions in order A,B,A.
Attachment #603451 - Attachment is obsolete: true
Attachment #628502 - Flags: review?(nrthomas)
Comment on attachment 628502 [details] [diff] [review]
hardlink harder

Looks fine too.
Attachment #628502 - Flags: review?(nrthomas) → review+
Comment on attachment 628502 [details] [diff] [review]
hardlink harder

checked into tools repo. will do svn deployment tomorrow
Attachment #628502 - Flags: checked-in+
Blocks: 760059
bounced in deployment because permissions don't get set correctly
right now we rely on umask to have files that are readable. I'm not sure why we haven't tried to set permissions explicitly in post_upload.py before, but this patch sets new files to 0644. the previous version failed because it uses mkstemp() to create the new files, which have permissions 0600.
Attachment #628774 - Flags: review?(nrthomas)
Attachment #628774 - Flags: review?(bhearsum)
Comment on attachment 628774 [details] [diff] [review]
set permissions explicitly

Review of attachment 628774 [details] [diff] [review]:
-----------------------------------------------------------------

I briefly looked at some existing files and couldn't find anything that wasn't 644, so I think this is safe to do...
Attachment #628774 - Flags: review?(bhearsum) → review+
Comment on attachment 628774 [details] [diff] [review]
set permissions explicitly

We could go a step further and ensure the group ownership is set to the product, to ensure the primary group is set properly, but with files getting 644 and dirs getting 755 that's not really necessary.
Attachment #628774 - Flags: review?(nrthomas) → review+
Comment on attachment 628774 [details] [diff] [review]
set permissions explicitly

will wait for current releases to finish before deploying with svn
Attachment #628774 - Flags: checked-in+
deployed via svn/puppet
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
FYI, recent changes in how the directories are broken up into several partitions negate the hardlinking added here. eg firefox/tinderbox-builds and firefox/nightly are now different partitions, where previously they were on a single partition.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.