Open Bug 1408357 Opened 7 years ago Updated 2 years ago

mach vendor rust should detect copies to avoid hg repo bloat

Categories

(Firefox Build System :: General, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: kats, Unassigned)

References

Details

Often when a dependency is updated in a crate we will end up with a crate getting moved and a new crate getting added. For example the third_party/rust/bincode folder (currently holding 0.8.0) might get moved to bincode-0.8.0 with the bincode folder getting updated to 0.9.0. (this is the scenario that will happen with the next webrender update).

The thing is, the files getting moved to -0.8.0 don't get detected as moves by hg so it ends up storing another copy of them internally. So if we are smart about this we can avoid hg storing the same thing twice. Right now mach vendor rust uses `hg addremove` to stage the changes which will detect renames but not copies. If `hg addremove` detected copies we would get this fixed for free, which would be great. That is an outstanding bug against hg, https://bz.mercurial-scm.org/show_bug.cgi?id=3432. :gps, anything you can do to help move that along?

Failing that another approach we could take would be to not use `hg addremove` but instead write something smarter that takes advantage of what we know happens during vendoring at the crate level (this crate got copied over there, and the in-place one got updated).

For context, the reason I thought of this is because the bincode repo contains a useless logo.png which is larger than the 100k file size limit that `mach vendor rust` now imposes. I'll deal with that separately though.
Flags: needinfo?(gps)
There's a `--similarity` argument to addremove that takes a percentage to use to detect renames. We could add that and tweak it until we get reasonable results.
--similarity only picks up on moves but not copies. That is, the original file would have to be removed in order for it to work, which is usually not the case for us. Tweaking the similarity down from the default of 100 might help but I expect it will have a minimal impact based on the types of changes that vendoring produces in practice.
Mercurial's default storage layer doesn't do content-level de-duping (like Git). So copy detection would be mostly about preserving metadata as opposed to storage savings.

I suspect we could get some traction to fix this upstream. Especially since people have recently touched the copy tracing code for the 4.4 release (https://www.mercurial-scm.org/repo/hg/rev/036d47d7cf39). I don't have a good grasp on this area of the code, otherwise I would consider taking a stab at it. I'll comment on the upstream issue...
Flags: needinfo?(gps)
Product: Core → Firefox Build System
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.