Currently we manually set up working directories that pull directly from hg.m.o (and other servers), do any needed conversion, then push to one or more target servers. We intermittently get bad pulls from hg.m.o. I'm not sure if it's server specific or what; aiui our current theory is it's a replication-to-webheads issue for the hg lockfiles. We're guessing that changing from http(s):// to ssh:// would remove or reduce the occurrence of these, but we're not sure. Corrupting our conversion workspace can mean we have to restart that process from zero, which in the case of non-cvs-based m-c (with nothing else added) took on the order of 5 hours to fix. In the case of needing to prepend cvs history and pull in multiple other repos, etc, the fix would take much much longer. I think we should prevent repo corruption in the first place. One way to avoid corrupting our conversion workspace is to do an intermediate clone/pull elsewhere, e.g. 1) clone/pull in /builds/hg-shared/mozilla-central 2) hg verify in /builds/hg-shared/mozilla-central 3) if verify fails (and/or we detect corruption through error messages, etc.), we blow away /builds/hg-shared/mozilla-central and reclone it. Repeat until we do pass verification or we hit some max retries limit (and fail out noisily). 4) once we pass verify, clone/pull from /builds/hg-shared/mozilla-central to our working directory. Steps 1-3 (or at least (1) and (3)) are provided by hgtool -s, and possibly gittool -s. However, there's a hardcoded assumption that we're going to then share from that directory to the working directory. I think we should clone rather than share, to avoid shared .hg or .git corruption. We can either have some duplicate code, or add more functionality to those tools. Also, we will be on a machine running many of these processes; we may have multiple mozilla-centrals that need updating. We may not be able to use the same /builds/hg-shared/mozilla-central for three different process loops unless we make sure that they don't stomp all over each other. This may mean multiple discrete share dirs.
I already implemented this in the existing beagle branch: https://github.com/escapewindow/mozharness/blob/beagle/scripts/poc_beagle.py#L138 I'd love to verify this when the stage mirror actually catches a corrupt update.
Calling this fixed.