Closed Bug 574348 Opened 14 years ago Closed 14 years ago

Occasional hg corruption (abort: foo.i@XXXXXXXXXX: no match found)

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: aravind)

References

Details

Several times a week we hit at error with hg such as:
abort: data/talos-r3/master1.cfg.i@6f63c247e067: no match found!

This happens when pulling / updating from hg.mozilla.org, and seems to happen at the same time when we're landing changes to the repo.

It requires the slave to be manually cleaned up.
is it always with the same repo?
Assignee: server-ops → aravind
No, I've seen it on build/buildbot-configs, build/tools and mozilla-central at least.
The google hits I get ask to check the integrity of the repos in question.  Since a subsequent clone works fine, I am assuming the server repo is fine.  Next time you hit this issue, can you run a "hg verify" on the repo?

Are these problems limited to any particular O.S?
I think I've hit an incarnation of this bug last night on the l10n dashboard.

Here's what my code does: It asks json-pushes for the changesets it's having, runs and hg pull, and then asks the local repo for further details. The local clone didn't have a specific revision last night. Running the code again made things go fine, which leads me to expect that the repo for the pushes hook had newer data than the one I pulled from.

Speculation on irc yesterday was that if you happen to switch from one server to the other during a pull, you could end up with file manifests for different versions.

Sorry, didn't find the bug early enough to catch aravind's last request to run an hg verify, will do on the next occasion.
talos-r3-snow-016:tools cltbld$ hg verify
checking changesets
checking manifests
crosschecking files in changesets and manifests
checking files
 buildfarm/utils/generate-tpcomponent.py@654: d8515fbec4b4 in manifests not found
542 files, 655 changesets, 1521 total revisions
1 integrity errors encountered!
(first damaged changeset appears to be 654)
The clone started at 06:38:26 and failed at 06:39:15.  A change was pushed at 06:38:40.  The revision 'd8515fbec4b4' doesn't exist in the repo AFAICT.
Any chance to get a tarball of that failed clone attached to the bug so that we can debug locally?

PS: Likely the push of the 3.6.6 release config update.
@djc: any ideas here?  I am kind of lost.

catlee says it happens when the repo had a recent push.  We do see some corruption on the local repo when this happens (see comment 6).
Every single one of the builds on my most recent push are burning, like:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1279029231.1279029324.9414.gz
abort: data/thunderbird/l10n-thunderbird-changesets-3.1.i@add4593f45ee: no match found!
(cloning buildbot-configs)
@djc:  What can I provide for hg folks to troubleshoot this?  We are having to turn off caching to cope with the problem, and even that doesn't make it go away.  We still see this problem, only now its ephemeral.
17:47 < mpm> It smells like rsync or hook weirdness.
17:50 < mpm> Can we find out what versions they're running and how their servers interact?
17:52 < mpm> There are 3 ways this can happen:
17:52 < mpm> a) client has corruption and pushes it to server
17:53 < mpm> b) rsync or similar updates manifest before files
17:53 < mpm> c) rollback (ie due to failing hook) during pull with older hg
17:54 < mpm> Recent hg has a config option to check for (a) too.
The repo on the server isn't corrupted, 'hg verify' shows that, so I think that rules out (a). Our server is running hg 1.5.4 (bug 551015), so I don't think it's (c) (I know we've seen that in the past).

I'm not sure how data gets from the backend servers to the webheads, so (b) seems plausible. Aravind?
mpm (mercurial dev) suggested adding sync,noac options to the nfs mount options, we did that and re-enabled caching on the buildbot-configs repos.  Please comment here if you notice these problems again.
We're hosting hg repos on NFS?  That just sounds like fail waiting to happen...
Aravind discussed this with mpm, and he said NFS should work fine, modulo some write ordering issues. He suggested adding sync to the NFS options and had confidence that it would fix the problem.
Please re-open if this continues to happen.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
This has reoccurred:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1280421897.1280422060.4525.gz
updating working directory
abort: data/thunderbird/l10nbuilds.ini.i@5d6f213899b2: no match found!

(on buildbot-configs, of course)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
On a fresh pull of buildbot-configs on my local machine:

mark-banners-macbook:~ mark$ hg clone http://hg.mozilla.org/build/buildbot-configs
destination directory: buildbot-configs
requesting all changes
adding changesets
adding manifests
adding file changes
added 2770 changesets with 6447 changes to 1357 files
updating to branch default
abort: data/thunderbird/l10nbuilds.ini.i@5d6f213899b2: no match found!

Something needs poking at the hg end I believe.
Severity: minor → blocker
had a corrupt cache, clearing that fixed it.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.