Closed Bug 1545387 Opened 5 years ago Closed 5 years ago

Intermittent Exception: Could not find any candidate pushheads in the last 50 revisions.

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, regression)

[task 2019-04-18T09:48:00.898Z] 09:48:00 INFO - The error occurred in code that was called by the mach command. This is either
[task 2019-04-18T09:48:00.898Z] 09:48:00 INFO - a bug in the called code itself or in the way that mach is calling it.
[task 2019-04-18T09:48:00.898Z] 09:48:00 INFO - You can invoke |./mach busted| to check if this issue is already on file. If it
[task 2019-04-18T09:48:00.898Z] 09:48:00 INFO - isn't, please use |./mach busted file| to report it. If |./mach busted| is
[task 2019-04-18T09:48:00.898Z] 09:48:00 INFO - misbehaving, you can also inspect the dependencies of bug 1543241.
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - If filing a bug, please include the full output of mach, including this error
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - message.
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - The details of the failure are as follows:
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - Exception: Could not find any candidate pushheads in the last 50 revisions.
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - Search started with 60ed0f34be69dd3a75fb2138caa5c5a150eea1b8, which must be known to Mozilla automation.
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - see https://developer.mozilla.org/en-US/docs/Artifact_builds
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/mach_commands.py", line 1301, in artifact_install
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - return artifacts.install_from(source, distdir or self.distdir)
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/artifacts.py", line 1333, in install_from
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - return self.install_from_recent(distdir)
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/artifacts.py", line 1254, in install_from_recent
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - return self._install_from_hg_pushheads(hg_pushheads, distdir)
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/artifacts.py", line 1233, in _install_from_hg_pushheads
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - for trees, hg_hash in hg_pushheads:
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/artifacts.py", line 1142, in _find_pushheads
[task 2019-04-18T09:48:00.899Z] 09:48:00 INFO - rev=last_revs[0], num=NUM_PUSHHEADS_TO_QUERY_PER_PARENT))
[task 2019-04-18T09:48:00.900Z] 09:48:00 INFO - Makefile:167: recipe for target 'recurse_artifact' failed
[task 2019-04-18T09:48:00.900Z] 09:48:00 ERROR - make[3]: *** [recurse_artifact] Error 1
[task 2019-04-18T09:48:00.900Z] 09:48:00 INFO - make[3]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-04-18T09:48:00.900Z] 09:48:00 INFO - /builds/worker/workspace/build/src/config/recurse.mk:32: recipe for target 'artifact' failed
[task 2019-04-18T09:48:00.900Z] 09:48:00 ERROR - make[2]: *** [artifact] Error 2
[task 2019-04-18T09:48:00.900Z] 09:48:00 INFO - /builds/worker/workspace/build/src/config/rules.mk:400: recipe for target 'default' failed
[task 2019-04-18T09:48:00.900Z] 09:48:00 ERROR - make[1]: *** [default] Error 2
[task 2019-04-18T09:48:00.900Z] 09:48:00 INFO - client.mk:125: recipe for target 'build' failed
[task 2019-04-18T09:48:00.900Z] 09:48:00 INFO - make: *** [build] Error 2
[task 2019-04-18T09:48:00.941Z] 09:48:00 INFO - 0 compiler warnings present.
[task 2019-04-18T09:48:01.018Z] 09:48:01 ERROR - Return code: 2
[task 2019-04-18T09:48:01.018Z] 09:48:01 WARNING - setting return code to 2
[task 2019-04-18T09:48:01.019Z] 09:48:01 FATAL - 'mach build -v' did not run successfully. Please check log for errors.
[task 2019-04-18T09:48:01.019Z] 09:48:01 FATAL - Running post_fatal callback...
[task 2019-04-18T09:48:01.019Z] 09:48:01 FATAL - Exiting -1

It seems to me the error is a bit further up the log. I've got these lines from two runs of the same job:

hg suggested 500 candidate revisions
hg suggested candidate revision: c91ac1a6d3adb071d5d9a92de95ffc7c3be7ad91

and

hg suggested 500 candidate revisions
hg suggested candidate revision: 305d1bc1bda9880aea7801204f3b5e27766f38f3

The code in question is a call to hg for the latest public revisions and it looks like hg is returning revisions which aren't public (ie. draft revisions on Try) for some reason.

Sounds a lot like a reincarnation of bug 1462323. Chris, do you have an idea, or whom to ask given that gps isn't with us anymore?

Flags: needinfo?(cmanchester)

Yes that's what it looks like. In the case of https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=242816890&revision=831c0dde08f3416b84b3c8289f1495f8add926da the result of that revset isn't deterministic and we succeed on the third try. This almost looks like corruption in phases data in some checkouts. Connor, does that seem plausible? Is there anything else we can do to debug this?

Flags: needinfo?(cmanchester) → needinfo?(sheehan)

(In reply to Chris Manchester (:chmanchester) from comment #6)

This almost looks like corruption in phases data in some checkouts. Connor, does that seem plausible?

Yes, that's entirely plausible and after some digging, is the exact cause of the recent spike in this bug.

I did some digging and noticed that in the failing tasks, the first two revisions returned by that revset are the revisions from the try push itself. These revisions should obviously be drafts, and thus the revset should not pick them up. In the task that succeeded, I noticed that most of the revset entries are the same, except we do not see the revisions from the push included in the list of candidate revisions.

Looking back at the Mercurial checkout in the logs, I noticed that in the passing logs, the newly added revisions from our push are marked as drafts, while in the failing logs they aren't. So something is causing the revisions to be promoted to the draft phase when transferred to the failing tasks.

Upon further inspection I also realized that the failing tasks pulled down the revisions from the private hgmo service in us-west-2, while the task that succeeded pulled down from public hgmo. I took a look at possible differences between the try repo on the hgweb hosts in MDC1 and the hgweb hosts in AWS, and found the problem: the hgrc file for the Try repo had not been replicated to the mirrors in AWS. Since the repo in AWS didn't have the hgrc with phases.publish=False, any revisions pulled down from the private hgmo service would not transfer phases data and default to public, causing those revisions to appear in the revset and compromise the search for build artifacts. (Note that the phases data on the try repos in AWS is correct, only transferring revisions from those hosts via pull would cause phase promotion). Below is a set of commands I used to confirm this behaviour:

~ $ mkdir test-phases-serving
~ $ cd test-phases-serving 
~/test-phases-serving $ hg init test-draft-nonpublish 
~/test-phases-serving $ cd test-draft-nonpublish 
~/test-phases-serving/test-draft-nonpublish $ echo f > a
~/test-phases-serving/test-draft-nonpublish $ hg addremove
adding a
~/test-phases-serving/test-draft-nonpublish $ hg di
~/test-phases-serving/test-draft-nonpublish $ hg commit -m first
~/test-phases-serving/test-draft-nonpublish $ echo x > b
~/test-phases-serving/test-draft-nonpublish $ hg addremove
adding b
~/test-phases-serving/test-draft-nonpublish $ hg commit -m second
~/test-phases-serving/test-draft-nonpublish $ hg log -T '{phase}\n' 
draft
draft
~/test-phases-serving/test-draft-nonpublish $ cd ..
~/test-phases-serving $ hg init test-pull
~/test-phases-serving $ cd test-pull
~/test-phases-serving/test-pull $ hg pull ../test-draft-nonpublish 
pulling from ../test-draft-nonpublish
requesting all changes
adding changesets
adding manifests
adding file changes
added 2 changesets with 2 changes to 2 files
new changesets 9ff07c2c03c4:c94add2d7cd3
(run 'hg update' to get a working copy)
~/test-phases-serving/test-pull $ # note the new changesets are not drafts!
~/test-phases-serving/test-pull $ hg log -T '{phase}\n'
public
public
~/test-phases-serving/test-pull $ cd ..
~/test-phases-serving $ rm -rf test-pull
~/test-phases-serving $ hg init test-pull
~/test-phases-serving $ cd test-draft-nonpublish
~/test-phases-serving/test-draft-nonpublish $ hg config -e -l  # Set `phases.publish=False`
~/test-phases-serving/test-draft-nonpublish $ cd ../test-pull 
~/test-phases-serving/test-pull $ hg pull ../test-draft-nonpublish
pulling from ../test-draft-nonpublish
requesting all changes
adding changesets
adding manifests
adding file changes
added 2 changesets with 2 changes to 2 files
new changesets 9ff07c2c03c4:c94add2d7cd3 (2 drafts)
(run 'hg update' to get a working copy)
~/test-phases-serving/test-pull $ hg log -T '{phase}\n'
draft
draft

The reason the hgrc file was not replicated to the AWS hosts is because the Try repo is special cased during the hgweb bootstrap procedure. The replication system uses hg pull for all data transfers, and performing an hg pull of the try repo over SSH would take upwards of 24 hours for a full transfer. To work around this, we instead perform a streamed clone of the try repo as a special exception during bootstrap. Unlike the hg pull based replication, streamed clones do not transfer the underlying hgrc file to the new repo.

Adding a call to hg replicatehgrc to the bootstrap procedure after streaming the try repo would fix this bug moving forward. For the moment, I've manually triggered replicationo of the hgrc file to the AWS hosts in us-west-2, so changesets transferred from Try on the private hg service should now be in the draft phase. I'll send a patch to fix the bootstrap procedure so it does this automatically.

Flags: needinfo?(sheehan)

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/f98ead5d399d
ansible/bootstrap: run hg replicatehgrc on try repos after running a stream clone

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.