Closed Bug 969689 Opened 7 years ago Closed 7 years ago

Do our own hg purge

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ehsan.akhgari, Unassigned)

References

Details

Attachments

(2 files, 2 obsolete files)

For try builds we have tried to use hg purge to avoid having to repopulate the entire working directory every single time we build, but we have failed because of what seems like a dirstate corruption caused by hg purge.  If we wrote our own bug free tool which ran hg status, and then deleted the stuff that hg doesn't know about manually (using winrm on windows) then we would potentially be able to use that tool to purge our checkout and just hg update to the new revision we want to build.
All the built-in hg purge extension does is run a status, and then delete files that are unknown/ignored. I suspect what's going on in production is .hg/dirstate corruption....perhaps if we remove that prior to updating/purging we'll be ok?
(In reply to comment #1)
> All the built-in hg purge extension does is run a status, and then delete files
> that are unknown/ignored. I suspect what's going on in production is
> .hg/dirstate corruption....perhaps if we remove that prior to updating/purging
> we'll be ok?

I think relying on that working as expected is really dangerous.
http://selenic.com/hg/file/b0638b5b004d/hgext/purge.py looks pretty simple and straight forward... The issue may be coming from hg share.
Arguably, if hg purge has bugs, we should fix them instead of reinventing a new thing that might work better. With that being said, the only problem i know we had with hg purge on try was likely due to the combination of the use of hg share and a try reset.
I've never encountered an issue with purge.

Totally agree with glandium that we should fix hg instead of working around this.

We can start by upgrading Mercurial to 2.8 or 2.9 to help rule out a bug in older Mercurial clients. Bug 961279 tracks.

Before we start blindly deleting .hg/dirstate, perhaps we should run `hg summary` first to see if there is anything weird going on in the working copy before apparent corruption is detected.

As for hg share, please confirm we're not using mq, rebase, graft, or any other history-rewriting command in automation. If we are, shared repos become corrupt very easily. (This behavior is clearly documented in the output of `hg help share`.)
(In reply to comment #3)
> http://selenic.com/hg/file/b0638b5b004d/hgext/purge.py looks pretty simple and
> straight forward... The issue may be coming from hg share.

Well, I think at least this part of this code is wrong on Windows, unless there is some Python magic involved which I'm unaware of: <http://selenic.com/hg/file/b0638b5b004d/hgext/purge.py#l90>

On Windows, you will fail to delete files for many other reasons.  The most common one is that the file is open in exclusive mode by some other program.
(In reply to comment #4)
> Arguably, if hg purge has bugs, we should fix them instead of reinventing a new
> thing that might work better.

Sure, but that doesn't need to block us from doing something better to address our infrastructure load issues.

> With that being said, the only problem i know we
> had with hg purge on try was likely due to the combination of the use of hg
> share and a try reset.

I think this might be what catlee and rail were talking about.  (This bug came out of my conversation with them in person on Friday evening, sorry it has a lot of details missing I suppose.)
I think the magic is http://selenic.com/hg/file/b0638b5b004d/hgext/purge.py#l96 -- this part retries to delete the same file after trying to fix the attributes. If the second call fails, it just raises an exception and exits with an error.
(In reply to comment #8)
> I think the magic is http://selenic.com/hg/file/b0638b5b004d/hgext/purge.py#l96
> -- this part retries to delete the same file after trying to fix the
> attributes. If the second call fails, it just raises an exception and exits
> with an error.

Oh right.  I guess that's sensible...
(In reply to Rail Aliiev [:rail] from comment #8)
> I think the magic is
> http://selenic.com/hg/file/b0638b5b004d/hgext/purge.py#l96 -- this part
> retries to delete the same file after trying to fix the attributes. If the
> second call fails, it just raises an exception and exits with an error.

It prints a warning on failure and only exits on error if invoked with --abort-on-err. We should be using --abort-on-err in automation.
Can we just upgrade Mercurial already? 2.6+ uses multiple I/O threads to perform `hg update`, which makes the operation faster, especially on SSDs.

https://tbpl.mozilla.org/php/getParsedLog.php?id=34354966&tree=Mozilla-Inbound&full=1 (a clobber) shows we take 188s to perform an hg up.

On my machine:

$ hg up -C null (makes the working copy bare)

$ time hg up central/default
90304 files updated, 0 files merged, 0 files removed, 0 files unresolved

real	0m13.977s
user	0m29.616s
sys	0m22.133s

$ hg up -C null
$ time hg --config worker.numcpus=1 up central/default
90304 files updated, 0 files merged, 0 files removed, 0 files unresolved

real	0m51.397s
user	0m22.227s
sys	0m16.079s

37.5s wall time savings. Not too shabby.

Newer Mercurial clients will be faster and they may even squash this bug. What are we waiting for?
(In reply to Gregory Szorc [:gps] from comment #11)
> Newer Mercurial clients will be faster and they may even squash this bug.
> What are we waiting for?

Can we reproduce the initial error and tell for sure that upgrading resolves the issue? If yes, I volunteer driving the upgrade. :)
(In reply to comment #12)
> (In reply to Gregory Szorc [:gps] from comment #11)
> > Newer Mercurial clients will be faster and they may even squash this bug.
> > What are we waiting for?
> 
> Can we reproduce the initial error and tell for sure that upgrading resolves
> the issue? If yes, I volunteer driving the upgrade. :)

I think the answer to both of those questions is no.  Upgrading Mercurial is orthogonal here.  If we come up with a way of purging our source tree which doesn't break, hg update needs to touch fewer files and will therefore be faster.
We can't rule out that a bug in 2.5.4 is the source of this problem. It should be very low risk and require little effort (at least on some builders like the EC2 Linux slaves) to upgrade Mercurial.

Here's an idea: let's isolate the variable. We know from bug 883918 that purge had issues on both Linux and OS X. Let's upgrade Mercurial on Linux slaves to 2.8 or 2.9 and see if Linux still has issues with purge. If not, we have our solution. If so, we're almost certainly looking at a Mercurial bug in tip or with a way Mozilla is using Mercurial. Either way, we've eliminated one unknown (Mercurial 2.5.4 bug) from the equation.
How about reproducing locally the bug we hit then testing it with the upgrade?  I personally don't want each of our slaves on different hg versions again,  Not do I see a compelling reason to upgrade at this stage. Lastly I'm let of turning on purge when it wasted lots of cycles and weird issues on try and other trees due to this bug, at least until we can reproduce the issue and see that it's fixed in a bee version. I'd we can prove it's fixed that's a great reason for upgrade imo
(In reply to Gregory Szorc [:gps] from comment #14)
> Here's an idea: let's isolate the variable. We know from bug 883918 that
> purge had issues on both Linux and OS X. Let's upgrade Mercurial on Linux
> slaves to 2.8 or 2.9 and see if Linux still has issues with purge. If not,
> we have our solution. If so, we're almost certainly looking at a Mercurial
> bug in tip or with a way Mozilla is using Mercurial. Either way, we've
> eliminated one unknown (Mercurial 2.5.4 bug) from the equation.

That would prove nothing if, like i suspect, the whole problem doesn't appear until a try reset.
I don't recall why we think this would be correlated to a try reset. Why would resetting try cause this behaviour?
I'd really rather not test this in production, it means we're not always building what we think we're building. In addition to explicit build failures, how do we know we're not getting cases where the build succeeds but with the wrong source on disk?

Is there something else we can do to validate that a checkout matches the revision specified?
(In reply to comment #18)
> Is there something else we can do to validate that a checkout matches the
> revision specified?

Not rely on hg purge?  Like I said in our discussion on Friday, we're implicitly trusting that hg status and hg update do what they are supposed to do on other branches already, so verifying that here doesn't seem necessary.  The only reason why we know that we could not trust hg status in the past has been our usage of hg purge (where the bug is in Mercurial doesn't matter).  Therefore, it seems to me like if we used another tool besides hg purge to do the purge then we would be in no worse spot with regards to trusting what is inside a checkout compared to the non-try branches.
(In reply to Chris AtLee [:catlee] from comment #17)
> I don't recall why we think this would be correlated to a try reset. Why
> would resetting try cause this behaviour?

The bug you pointed to me last week, which i don't remember the number, and which is about the only one i know that has some log parts in it, showed a message that in essence hg doesn't know about the changeset that's currently checked out. That smells a lot like something you'd get on try after a reset, where the hg share would have a fresh try with none of the previous try changesets, and a source tree with an old try changeset checkout.
OK, I'm looking at the build logs and the automation code for Mercurial. I see the following problems:

1) We have separate shared clones for each repo. e.g. /builds/hg-shared/mozilla-central and /builds/hg-shared/mozilla-inbound. Each repository chains up to the same root changeset, so there's no reason we can't pull all changesets into the same repository. This is wasting disk space and I/O on each slave. This is wasting network bandwidth due to excessive pulling of the same changesets. This is wasting CPU on hg.mozilla.org. If we do this, we'll get better page cache hit rates, free up disk space, reduce network load, and decrease CPU everywhere. It's wins all around. This issue is orthogonal to the one at hand.

2) I see code paths in the build-tools repo where the "source" shared repo may be deleted. If the source repo is deleted and *all* shared/derived repos aren't also deleted, I believe there is a high possibility for corruption. It's possible that our corruption was caused by deleting one clone but not all of them. I may submit a patch to Mercurial to detect this scenario.
I don't see how resetting try would be an issue *unless* we also deleted the source of the shared repo on clients as well. Since the reset try repo will chain to the same root changeset as the old try repo, clients should still be able to pull new heads from the new try repo. My mirror at https://hg.stage.mozaws.net/mirrors/generaldelta/try contains heads from old and new try repos and the mirroring code worked seamlessly as the try repo was reset.
(In reply to Gregory Szorc [:gps] from comment #22)
> I don't see how resetting try would be an issue *unless* we also deleted the
> source of the shared repo on clients as well.

Yes, that's what i'm saying might have happened. Considering your 2) in comment 21, I'm now pretty sure that's what happened.
(In reply to Gregory Szorc [:gps] from comment #21)
> OK, I'm looking at the build logs and the automation code for Mercurial. I
> see the following problems:
> 
> 1) We have separate shared clones for each repo. e.g.
> /builds/hg-shared/mozilla-central and /builds/hg-shared/mozilla-inbound.
> Each repository chains up to the same root changeset, so there's no reason
> we can't pull all changesets into the same repository. This is wasting disk
> space and I/O on each slave. This is wasting network bandwidth due to
> excessive pulling of the same changesets. This is wasting CPU on
> hg.mozilla.org. If we do this, we'll get better page cache hit rates, free
> up disk space, reduce network load, and decrease CPU everywhere. It's wins
> all around. This issue is orthogonal to the one at hand.

Yes, we should definitely do this. One challenge with implementing this is to identify which remotes are related to each other. We use hgtool and /builds/hg-shared for other repositories, not just gecko. Another issue is to identify any places where we're not specifying a revision for builds. 'tip' is no longer meaningful.

In any case, this doesn't apply to Try, where we're always pulling in the same repository.

> 2) I see code paths in the build-tools repo where the "source" shared repo
> may be deleted. If the source repo is deleted and *all* shared/derived repos
> aren't also deleted, I believe there is a high possibility for corruption.
> It's possible that our corruption was caused by deleting one clone but not
> all of them. I may submit a patch to Mercurial to detect this scenario.

So that sounds like a plausible theory...but wouldn't 'hg status' complain loudly about the state of the working directory in that case rather than claiming everything was ok?
(In reply to Chris AtLee [:catlee] from comment #24)
> > 2) I see code paths in the build-tools repo where the "source" shared repo
> > may be deleted. If the source repo is deleted and *all* shared/derived repos
> > aren't also deleted, I believe there is a high possibility for corruption.
> > It's possible that our corruption was caused by deleting one clone but not
> > all of them. I may submit a patch to Mercurial to detect this scenario.
> 
> So that sounds like a plausible theory...but wouldn't 'hg status' complain
> loudly about the state of the working directory in that case rather than
> claiming everything was ok?

The share extension essentially installs a "software symlink" that tells the repo "the .hg/store directory is at X." The only validation performed when opening the shared repo is whether the shared store exists. Now, Mercurial core will do some additional validation. e.g. it validates that the parent of the working directory is in the store.

I think we're falling through a crack where the destination repo's local state (including but not limited to dirstate) is getting out of sync with the shared source due to the shared source changing.

I performed these steps:

1) hg share firefox firefox-shared
2) Edit firefox-shared/.hg/sharedpath to point to an earlier clone of the original repo

$ hg st
warning: ignoring unknown working parent d8d8fa98ee7d!

$ hg up 551290ba9d3b
warning: ignoring unknown working parent d8d8fa98ee7d!
updating abort: data/mobile/android/base/GeckoAppShell.java.i@21f254a10c3a: no match found!

$ hg st
warning: ignoring unknown working parent d8d8fa98ee7d!
M .clang-format-ignore
M .gdbinit
M .gitignore
M .hgignore
M .hgtags
M .lldbinit
...

What's happening here is that Mercurial is getting confused because the shared source lost the parent of the working directory.

Now, to get to this state, I explicitly ignored the giant warning in `hg help share` about rewriting repo history. It's quite possible our automation is doing something similar.

My advice: if we delete the shared repo source, we need to delete all clones.

We could try deleting dirstate. But what about the files in .hg/cache? What if Mercurial adds new files into the .hg directory? Rule of thumb: if you are touching .hg manually, you are doing it wrong. Automation shouldn't make assumptions about Mercurial's operation outside of the `hg` command line interface. Please don't muck about inside .hg and just delete all derived clones when the shared source is deleted.
(In reply to Gregory Szorc [:gps] from comment #25)
> (In reply to Chris AtLee [:catlee] from comment #24)
> > So that sounds like a plausible theory...but wouldn't 'hg status' complain
> > loudly about the state of the working directory in that case rather than
> > claiming everything was ok?

In testing locally, it seems 'hg status' only complains until you 'hg up' to another version. The 'hg up' will issue a warning, but after that 'hg status' will have the strange behavior without issuing any warnings.

Using gps' method, I was also able to get a repository where "hg status" doesn't show a file as modified, but the file doesn't actually exist in the current revision that is checked out:

$ hg status
$ hg log new5.txt
$ ls
hey.txt  new5.txt  new.txt

This happened using hg 2.5.4, 2.6.3 (my Ubuntu version), and 2.9, so I don't think updating our hg version will help for this particular issue.

I agree we'll need to delete the clones if we delete the shared repo. Can we just grep through */.hg/sharedpath to look for the repo we're removing to find the clones?
Attached file test-hg.sh
Here's a script to reproduce the particular errant behavior in #c26. For some reason, the behavior goes away if some of the sleep 1's are removed (I'm not sure which ones are actually necessary :).
Attachment #8374982 - Attachment mime type: application/x-shellscript → text/plain
Yes, grepping */.hg/sharedpath should be sufficient to detect clones.

But keep in mind that anything in .hg is an implementation detail of Mercurial and could change at any time. Relying on things in .hg thus makes the barrier to upgrading clients harder because it will require more testing.

If you consult .hg/sharedpath, I highly recommend adding automated tests for this scenario so future hg upgrades won't break.
Awesome, great work Mike!

So...how do we tell if the try repo has been reset from the build machine's POV?
(In reply to Chris AtLee [:catlee] from comment #29)
> Awesome, great work Mike!
> 
> So...how do we tell if the try repo has been reset from the build machine's
> POV?

A Try repo reset shouldn't impact existing clones one bit as long as the old and new try repos both chain up to the same initial commit (rev 0).

The suspected problem has everything to do with our usage of the share feature.
(In reply to Michael Shal [:mshal] from comment #27)
> Created attachment 8374982 [details]
> test-hg.sh
> 
> Here's a script to reproduce the particular errant behavior in #c26. For
> some reason, the behavior goes away if some of the sleep 1's are removed
> (I'm not sure which ones are actually necessary :).

If you run `hg strip` with `hg share`, you are going to have problems. This script doesn't really prove anything.

---

$ hg help share
hg share [-U] SOURCE [DEST]

create a new shared repository

    Initialize a new repository and working directory that shares its history
    with another repository.

    Note:
       using rollback or extensions that destroy/modify history (mq, rebase,
       etc.) can cause considerable confusion with shared clones. In
       particular, if two shared clones are both updated to the same
       changeset, and one of them destroys that changeset with rollback, the
       other clone will suddenly stop working: all operations will fail with
       "abort: working directory has unknown parent". The only known
       workaround is to use debugsetparents on the broken clone to reset it to
       a changeset that still exists.
(In reply to Gregory Szorc [:gps] from comment #31)
> (In reply to Michael Shal [:mshal] from comment #27)
> > Created attachment 8374982 [details]
> > test-hg.sh
> > 
> > Here's a script to reproduce the particular errant behavior in #c26. For
> > some reason, the behavior goes away if some of the sleep 1's are removed
> > (I'm not sure which ones are actually necessary :).
> 
> If you run `hg strip` with `hg share`, you are going to have problems. This
> script doesn't really prove anything.

Right - though we don't explicitly 'hg strip' in automation, we do effectively the same thing by deleting the main repo and replacing it with possibly a subset of the repo that doesn't have the cset that the shared repo is actively using.

The script I attached was trying to show how what you described in #c25 can cause the behavior seen in bug 883918. Specifically that there is a file on disk that 'hg status' reports nothing for, 'hg purge' doesn't remove, and isn't actually in any changesets in the repository history.

> 
> ---
> 
> $ hg help share
> hg share [-U] SOURCE [DEST]
> 
> create a new shared repository
> 
>     Initialize a new repository and working directory that shares its history
>     with another repository.
> 
>     Note:
>        using rollback or extensions that destroy/modify history (mq, rebase,
>        etc.) can cause considerable confusion with shared clones. In
>        particular, if two shared clones are both updated to the same
>        changeset, and one of them destroys that changeset with rollback, the
>        other clone will suddenly stop working: all operations will fail with
>        "abort: working directory has unknown parent". The only known
>        workaround is to use debugsetparents on the broken clone to reset it
> to
>        a changeset that still exists.

This isn't what we actually see, though - all operations do not fail with "abort: working directory has unknown parent". All we see is a "warning: ignoring unknown parent xyz" message when "hg up <commit>" successfully updates from an invalid changeset to a valid changeset. After this, all hg operations appear to work without any warnings/errors, except that files from the invalid changeset linger in the working directory and aren't deleted by 'hg purge'.

Since the info from the 'hg purge' issue is scattered across several bugs, I think it would be useful to summarize:

1) We absolutely should update 'hg' in automation, for the performance reasons gps cited in #c11
2) Updating hg will not fix the 'hg purge' issue, so we don't need to experiment with that
3) Writing our own 'hg purge' using 'hg status' won't work, since 'hg status' is also affected when the cset disappears in the shared repo.
4) We need to detect when a shared clone is made invalid by clobbering to avoid getting our history rewritten out from under us. Then we should be able to turn 'hg purge' back on and monitor it carefully.
Attached patch shared (obsolete) — Splinter Review
catlee suggested using 'hg parent' to see if our shared repo was modified out from under us. If 'hg parent' returns nothing, we can clobber ourselves. I think the test case mimics the error in bug 883918.
Attachment #8375857 - Flags: feedback?(bhearsum)
I started a thread on the mercurial-devel list about building detection of broken shares into Mercurial. With any luck, I'll write a patch and land it in time for Mercurial 3.0, which is scheduled for release March 1.

http://www.selenic.com/pipermail/mercurial-devel/2014-February/056259.html
Comment on attachment 8375857 [details] [diff] [review]
shared

Review of attachment 8375857 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/python/mozilla_buildtools/test/test_util_hg.py
@@ +274,5 @@
> +        # Make the working repo have a new file
> +        run_cmd(['touch', 'newfile'], cwd=self.wc)
> +        run_cmd(['hg', 'add', 'newfile'], cwd=self.wc)
> +        # Sleeping here is necessary to trigger the odd behavior in
> +        # the hg share

Can you expand on why we don't hit the issue with the sleep? I see it referenced in this bug, but no real explanation.

::: lib/python/util/hg.py
@@ +433,5 @@
>                  log.info("We're currently shared from %s, but are being requested to pull from %s (%s); clobbering",
>                           dest_sharedPath_data, repo, norm_sharedRepo)
>                  remove_path(dest)
>  
> +            parent = get_revision(dest)

Can you add a brief comment here, and a link to this bug?
Component: Release Automation → General Automation
QA Contact: bhearsum → catlee
(In reply to Ben Hearsum [:bhearsum] from comment #35)
> Comment on attachment 8375857 [details] [diff] [review]
> shared
> 
> Review of attachment 8375857 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: lib/python/mozilla_buildtools/test/test_util_hg.py
> @@ +274,5 @@
> > +        # Make the working repo have a new file
> > +        run_cmd(['touch', 'newfile'], cwd=self.wc)
> > +        run_cmd(['hg', 'add', 'newfile'], cwd=self.wc)
> > +        # Sleeping here is necessary to trigger the odd behavior in
> > +        # the hg share
> 
> Can you expand on why we don't hit the issue with the sleep? I see it
> referenced in this bug, but no real explanation.

Unfortunately I'm really not sure why the sleep changes the behavior. I only found it was necessary when I tried to get a reproducible script together and found that the script "worked" whereas the same commands typed manually by me showed the failure case.

> 
> ::: lib/python/util/hg.py
> @@ +433,5 @@
> >                  log.info("We're currently shared from %s, but are being requested to pull from %s (%s); clobbering",
> >                           dest_sharedPath_data, repo, norm_sharedRepo)
> >                  remove_path(dest)
> >  
> > +            parent = get_revision(dest)
> 
> Can you add a brief comment here, and a link to this bug?

Sure, new patch on the way...
Attached patch shared (obsolete) — Splinter Review
Attachment #8376405 - Flags: review?(bhearsum)
Comment on attachment 8376405 [details] [diff] [review]
shared

Review of attachment 8376405 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/python/util/hg.py
@@ +440,5 @@
> +            # 'hg status', 'hg purge', and the like to do incorrect things. If
> +            # we detect this situation, then it's best to clobber and re-create
> +            # dest.
> +            parent = get_revision(dest)
> +            if not parent:

I'm not 100% confident this will catch all instances where the source repo was replaced because I'm not sure how some of the low-level caching in Mercurial works.

We can try this. But no guarantees.
(In reply to Gregory Szorc [:gps] from comment #38)
> I'm not 100% confident this will catch all instances where the source repo
> was replaced because I'm not sure how some of the low-level caching in
> Mercurial works.
> 
> We can try this. But no guarantees.

Same here. We'll definitely have to monitor it closely after we turn purge back on.
Comment on attachment 8375857 [details] [diff] [review]
shared

Forgot to mark this obsolete
Attachment #8375857 - Attachment is obsolete: true
Attachment #8375857 - Flags: feedback?(bhearsum)
Attachment #8376405 - Flags: review?(bhearsum) → review+
Attached patch sharedSplinter Review
Minor changes from chat with catlee:
 - Fixed the log message to have a argument for '%s'
 - Replaced the time.sleep(1) with a 'touch -d yesterday <file>'.

r+ carried forward.
Attachment #8376405 - Attachment is obsolete: true
Attachment #8377617 - Flags: review+
I updated the standalone version in puppet too:
https://hg.mozilla.org/build/puppet/rev/b04c04ae7005

That needs merging to production before it will be useful.
Looks like it's working!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
This worked fine over a try reset too (checked windows and linux).
Depends on: 976436
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.