Closed Bug 1747280 Opened 2 years ago Closed 2 years ago

"comm" clone remains in Firefox checkout, even in non-TB CI jobs.

Categories

(Firefox Build System :: Task Configuration, defect)

defect

Tracking

(firefox97 fixed)

RESOLVED FIXED
97 Branch
Tracking Status
firefox97 --- fixed

People

(Reporter: mhentges, Assigned: mozilla)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

This can lead to intermittents due to in-tree code assuming it's running from a specific revision without any code changes. Since this "mutated checkout" state isn't consistent, tracking down associated failures will be tough.

This bug was caused by this, because a previous Thunderbird task left comm/ sitting in the checkout.

For docker-worker, at least, it looks like we:

  • grab a cache that persists across tasks, e.g. gecko-level-3-checkouts-hg58-v3-35e6d2147228a7dd8319, and mount that at /builds/worker/checkouts
  • we set the hg-store path to /builds/worker/checkouts/hg-store, which makes sense, because we want the hg store to persist across tasks
  • we set the gecko path to /builds/worker/checkouts/gecko, which also persists across tasks

Then we hg robustcheckout with --purge, which in theory should clean up all unknown files lying around inside the clone and give us a clean checkout. This runs hg purge.

However, hg purge does not remove a comm checkout inside of your gecko checkout, presumably because we're only purging our clone, not nested clones. To verify, I cloned comm-central inside mozilla-unified, then ran hg purge -p --all. This printed all unknown files, including my objdir, but did not mention comm/.

Therefore, I'm inclined to rename this bug to one of the following:

  • "robustcheckout --purge doesn't handle nested clones", or
  • "thunderbird builds should use a separate checkout path, e.g. /builds/worker/checkouts/tb-gecko/ [until they run from a single branch]" (this may be disk inefficient, but may avoid this bug), or
  • "thunderbird builds should use separate worker pools". This last will be very inefficient, especially with our limited PGO hardware builder pools.

I'm leaning towards someone adding a nested clone cleanup option to robustcheckout. What do you think? Keeping this under the --purge option will be easier, since we don't have to uplift everywhere, but --purge that does more than hg purge may be a confusing misnomer. But having a different option may require us to uplift everywhere to roll out.

Flags: needinfo?(mhentges)

(Maybe hg purge should be able to handle nested clones, given a --include-nested-clones arg or similar.)

(Or maybe whatever directory traversal we're using for our python stuff should be clone-boundary aware?)

Therefore, I'm inclined to rename this bug to one of the following:

Hmm, good point, this bug name is a little bit too broad. I'll modify it :)

I'm leaning towards someone adding a nested clone cleanup option to robustcheckout. What do you think? Keeping this under the --purge option will be easier, since we don't have to uplift everywhere, but --purge that does more than hg purge may be a confusing misnomer. But having a different option may require us to uplift everywhere to roll out.

That makes sense - it looks like hg handles nested repos in weird ways.
For example, even if comm is removed from .hgignore, it still doesn't show up in hg status or even hg status -A.

Another option is that, in the same code that runs hg purge, we can optionally remove <checkout>/comm if the current task doesn't need Thunderbird? Unsure how viable that is :)

Summary: Tasks aren't always run from clean checkouts → "comm" clone remains in Firefox checkout, even in non-TB CI jobs.

I'd rather we be generic since the issue will appear for any nested clone, but yeah, we could hardcode nuking comm/.

Moving components. If we hack the comm removal, that's probably in https://searchfox.org/mozilla-central/source/taskcluster/scripts/run-task, probably after we run robustcheckout --purge on line 471.

Otherwise we may want to hack https://hg.mozilla.org/hgcustom/version-control-tools/file/tip/hgext/robustcheckout to update robustcheckout --purge to purge across clone boundaries, or we may want to hack https://www.mercurial-scm.org/repo/hg/file/tip/hgext/purge.py to purge across clone boundaries. (Edit: this last may be more dirstate; purge is now hg core)

Component: General → Task Configuration
Product: Release Engineering → Firefox Build System
QA Contact: aki

Thanks Aki, I appreciate it. I'll float this by :sheehan as well next time we have a chat :)

Flags: needinfo?(mhentges)

I suspect https://www.mercurial-scm.org/repo/hg/file/tip/rust/hg-core/src/dirstate_tree/status.rs for the upstream, but I'm not familiar enough with the mercurial project to know if they've fully cut over to rust dirstate. (Edit: yeah, I think dirstate_node means it has an .hg/dirstate, and we compare these against our current path, and so we ignore nested clones. Not sure if we can convince upstream that we want to be able to nuke, or at least detect, nested clones)

Oh, https://www.mercurial-scm.org/repo/hg/annotate/tip/rust/hg-core/src/dirstate_tree/status.rs#l703

# DirEntry
           707     /// If a `.hg` sub-directory is encountered:
	   708     ///
	   709     /// * At the repository root, ignore that sub-directory
	   710     /// * Elsewhere, we’re listing the content of a sub-repo. Return an empty
	   711     ///   list instead.
<snip>
           719             if name == b".hg" {
	   720                 if is_at_repo_root {
	   721                     // Skip the repo’s own .hg (might be a symlink)
	   722                     continue;
	   723                 } else if metadata.is_dir() {
	   724                     // A .hg sub-directory at another location means a subrepo,
	   725                     // skip it entirely.
	   726                     return Ok(Vec::new());
	   727                 }
	   728             }

An easy way out would be for comm-central to use different caches. As a matter of fact, it is concerning that comm-central ends up using caches from m-c!! (and vice versa)

Hm, https://hg.mozilla.org/comm-central/file/tip/taskcluster/ci/config.yml#l5 says trust-domain: comm, and https://searchfox.org/mozilla-central/source/taskcluster/gecko_taskgraph/transforms/task.py#2086 seems to tell me their caches should be named e.g. comm-level-3-*, not sure what's going on.

And indeed, https://firefox-ci-tc.services.mozilla.com/tasks/aUU8VrF4S7KkN8qMOrGR3A/definition says

    "cache": {
      "comm-level-3-checkouts-hg58-v3-35e6d2147228a7dd8319": "/builds/worker/checkouts"
    },

Ah, mystery solved.
https://firefox-ci-tc.services.mozilla.com/provisioners/gecko-3/worker-types/b-linux/workers/us-west-2/i-0f07dbdd34c9d0b11 is the busted worker. Before running the busted task it ran https://firefox-ci-tc.services.mozilla.com/tasks/Z_7cNdyASPi5LMkEnfcEbw/runs/0 .

Cross-channel clones all of the shipping repos and then builds a set of all en-US strings for localizers, so we cloned comm-central as part of the task.

Solutions:

  • nuke comm/ at the end of cross-channel (finally block after https://searchfox.org/mozilla-central/rev/361f258f46af4b9c881be81d1291000827c15704/tools/compare-locales/mach_commands.py#182 ?)
  • I think :rjl was looking at moving cross-channel for thunderbird into a separate task / repo so :flod doesn't have to deal with Thunderbird strings; if we ran these tasks on comm workers, we wouldn't have to worry about comm clones. (This solution may take a while to implement, though.)
  • continue with robustcheckout/hg purge nested-clone support, since the above two address this one case, but this solution will solve future cases as well
Assignee: nobody → aki
Status: NEW → ASSIGNED

^ should address the first point. I think the second is in progress, and we may want to file a followup for the third.

See Also: → 1742711
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/68098f573c46
nuke comm/ after cross-channel. r=releng-reviewers,jmaher DONTBUILD
Blocks: 1747460
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch

Backed out for causing Bug 1747545.

Push with failures

Failure log

Backout link

[task 2021-12-24T11:08:06.995Z] Processing mozilla-unified in /builds/worker/checkouts/gecko
[task 2021-12-24T11:08:06.995Z] Gathering files for central
[task 2021-12-24T11:08:06.995Z] Gathering files for beta
[task 2021-12-24T11:08:06.995Z] Gathering files for release
[task 2021-12-24T11:08:06.995Z] Gathering files for esr91
[task 2021-12-24T11:08:06.995Z] Writing mozilla-unified content to target
[task 2021-12-24T11:08:06.995Z] Error running mach:
[task 2021-12-24T11:08:06.995Z] 
[task 2021-12-24T11:08:06.995Z]     ['l10n-cross-channel', '-o', '/builds/worker/artifacts/outgoing.diff', '--attempts', '5', '--ssh-secret', 'project/releng/gecko/build/level-3/l10n-cross-channel-quarantine-ssh', 'prep', 'create', 'push']
[task 2021-12-24T11:08:06.995Z] 
[task 2021-12-24T11:08:06.995Z] The error occurred in code that was called by the mach command. This is either
[task 2021-12-24T11:08:06.995Z] a bug in the called code itself or in the way that mach is calling it.
[task 2021-12-24T11:08:06.995Z] You can invoke |./mach busted| to check if this issue is already on file. If it
[task 2021-12-24T11:08:06.995Z] isn't, please use |./mach busted file l10n-cross-channel| to report it. If |./mach busted| is
[task 2021-12-24T11:08:06.995Z] misbehaving, you can also inspect the dependencies of bug 1543241.
[task 2021-12-24T11:08:06.995Z] 
[task 2021-12-24T11:08:06.995Z] If filing a bug, please include the full output of mach, including this error
[task 2021-12-24T11:08:06.995Z] message.
[task 2021-12-24T11:08:06.995Z] 
[task 2021-12-24T11:08:06.995Z] The details of the failure are as follows:
[task 2021-12-24T11:08:06.995Z] 
[task 2021-12-24T11:08:06.995Z] hglib.error.ServerError: server exited with status 255: b'abort: repository /builds/worker/checkouts/gecko/comm not found'
[task 2021-12-24T11:08:06.995Z] 
[task 2021-12-24T11:08:06.995Z]   File "/builds/worker/checkouts/gecko/tools/compare-locales/mach_commands.py", line 194, in cross_channel
[task 2021-12-24T11:08:06.995Z]     actions,
[task 2021-12-24T11:08:06.995Z]   File "/builds/worker/checkouts/gecko/third_party/python/redo/redo/__init__.py", line 185, in retry
[task 2021-12-24T11:08:06.995Z]     return action(*args, **kwargs)
[task 2021-12-24T11:08:06.995Z]   File "/builds/worker/checkouts/gecko/tools/compare-locales/mach_commands.py", line 297, in _do_create_content
[task 2021-12-24T11:08:06.995Z]     status = ccc.create_content()
[task 2021-12-24T11:08:06.995Z]   File "/builds/worker/checkouts/gecko/python/l10n/mozxchannel/__init__.py", line 103, in create_content
[task 2021-12-24T11:08:06.995Z]     with hglib.open(repo_config["path"]) as repo:
[task 2021-12-24T11:08:06.995Z]   File "/builds/worker/checkouts/gecko/third_party/python/python-hglib/hglib/__init__.py", line 11, in open
[task 2021-12-24T11:08:06.995Z]     return client.hgclient(path, encoding, configs)
[task 2021-12-24T11:08:06.995Z]   File "/builds/worker/checkouts/gecko/third_party/python/python-hglib/hglib/client.py", line 67, in __init__
[task 2021-12-24T11:08:06.995Z]     self.open()
[task 2021-12-24T11:08:06.995Z]   File "/builds/worker/checkouts/gecko/third_party/python/python-hglib/hglib/client.py", line 261, in open
[task 2021-12-24T11:08:06.995Z]     % (ret, serr.strip()))
[taskcluster 2021-12-24 11:08:07.393Z] === Task Finished ===
[taskcluster 2021-12-24 11:08:07.506Z] Artifact "public/build" not found at "/builds/worker/artifacts"
[taskcluster 2021-12-24 11:08:07.606Z] Unsuccessful task run with exit code: 1 completed in 261.352 seconds
Status: RESOLVED → REOPENED
Flags: needinfo?(aki)
Resolution: FIXED → ---
Target Milestone: 97 Branch → ---

Marked for re-landing, with a fix.

Flags: needinfo?(aki)
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/237d30dec089
nuke comm/ after cross-channel. r=mhentges,releng-reviewers,jmaher DONTBUILD
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: