Closed Bug 883918 Opened 11 years ago Closed 10 years ago

"hg purge" causing intermittent problems on linux+osx

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: Gavin, Unassigned)

References

Details

Attachments

(1 file)

The push: https://tbpl.mozilla.org/?tree=Try&rev=d8c746ad3326
The build failure: https://tbpl.mozilla.org/php/getParsedLog.php?id=24157495&tree=Try&full=1

TEST-UNEXPECTED-FAIL | xpccheck | test test_webappsActor.js is missing from test manifest /builds/slave/try-osx64-00000000000000000000/build/dom/apps/tests/unit/xpcshell.ini!

My push didn't touch test_webappsActor.js, so this suggests that somehow that Try build on bld-lion-r5-027 wasn't properly clobbered before the build started.

No idea if this was just a one-off problem.
We don't clobber on try since the try bits of bug 851270 landed. We now do 'hg purge' instead of a clobber. Perhaps this is another occurrence of bug 873067.
Similar to the other bug, the file exists on disk, and hg thinks that's just fine:

bld-lion-r5-027:build cltbld$ hg ident
1066d9fca2ee
bld-lion-r5-027:build cltbld$ hg status ./dom/apps/tests/unit/test_webappsActor.js
bld-lion-r5-027:build cltbld$ hg log !$
hg log ./dom/apps/tests/unit/test_webappsActor.js
changeset:   130931:c50f597b1e6a
user:        Alexandre Poirot <poirot.alex@gmail.com>
date:        Mon May 06 09:51:53 2013 -0400
summary:     Bug 844227 - Add more functions to the webapps actor. r=fabrice

However, http://hg.mozilla.org/try/file/1066d9fca2ee/dom/apps/tests/unit doesn't list that file, nor does http://hg.mozilla.org/try/file/d8c746ad3326/dom/apps/tests/unit.

I think we should disable hg purge until we can track this bug down.
slave is disabled in slavealloc for investigation
Unlike bug 873067, I don't see anything obviously weird in the logs (e.g. a missing parent changeset).

Possibilities:

1) Mercurial bug (we're running 2.5.4, right?)
2) Performing the purge before |hg up| results in oddities (but I don't think it should matter)
3) Filesystem or other weirdness.

I'd hate to disable hg purge because it results in such a nice perf win. But, if it's buggy, that doesn't leave us much choice.
This machine hasn't gotten the hg update yet, it's still running 2.0.2
(In reply to Chris AtLee [:catlee] from comment #5)
> This machine hasn't gotten the hg update yet, it's still running 2.0.2

In that case I'm inclined to blame an old, buggy hg version.
alright. I'll clobber the machine and throw it back in the pool. I don't know what the status of hg upgrades on OSX builders is.
(In reply to Chris AtLee [:catlee] from comment #7)
> I don't know what the status of hg upgrades on OSX builders is.

Around the corner, Bug 868192

I'm currently leaving our OSX builders final deploy up to puppet320 deploy for OSX.

Bug 760093 tracks the puppet320 part.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
I just saw this on the try build at https://tbpl.mozilla.org/?tree=Try&rev=1844e440cadf

Log at https://tbpl.mozilla.org/php/getParsedLog.php?id=24389790&tree=Try

Slave details:

Linux try build on 2013-06-20 10:45:25 PDT for push 1844e440cadf
slave: bld-centos6-hp-041
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
[cltbld@bld-centos6-hp-041.build.scl1.mozilla.com ~]$ hg --version
Mercurial Distributed SCM (version 2.5.4)
(In reply to Chris AtLee [:catlee] from comment #10)
> [cltbld@bld-centos6-hp-041.build.scl1.mozilla.com ~]$ hg --version
> Mercurial Distributed SCM (version 2.5.4)

This is starting to look like a Mercurial bug :(
From a random try push, https://tbpl.mozilla.org/php/getParsedLog.php?id=24370008&tree=Try

Can we please stop using hg purge, stop giving vastly more utterly bogus try results than we know we are giving, and *then* decide where the bug lies and what we should do about it?
(In reply to Phil Ringnalda (:philor) from comment #12)
> From a random try push,
> https://tbpl.mozilla.org/php/getParsedLog.php?id=24370008&tree=Try
> 
> Can we please stop using hg purge, stop giving vastly more utterly bogus try
> results than we know we are giving, and *then* decide where the bug lies and
> what we should do about it?

Agreed.
Attachment #765932 - Flags: review?(rail)
I was going to suggest printing the output of |hg status -A| after |hg up| so we can identify the next culprit. But I concede we'd be in an undefined state and backing out is probably best.

This is such a weird bug.
Yeah, I'd certainly like to be able to reproduce this.

What does role does .hg/dirstate play? Is there anything we can look at in there to see why hg thinks the file belongs?

Also, if anybody catches this in action again, poke me or buildduty on irc and we can set aside the machine for debugging.
Attachment #765932 - Flags: review?(rail) → review+
Attachment #765932 - Flags: checked-in+
In production
Product: mozilla.org → Release Engineering
Found in triage, and moving to "Buildduty" because of comment#16. Tweaked summary, and bug dependencies, based on comments so far.


gps: Looks like no further occurrences have been reported since "hg purge" was disabled in comment#17. Any word from hg folks - is this a known issue? 





(In reply to Chris AtLee [:catlee] from comment #16)
> Yeah, I'd certainly like to be able to reproduce this.
> 
> What does role does .hg/dirstate play? Is there anything we can look at in
> there to see why hg thinks the file belongs?
> 
> Also, if anybody catches this in action again, poke me or buildduty on irc
> and we can set aside the machine for debugging.

emorley, ryanvm, tomcat: if you see this again, please ping buildduty so we can pull the machine from production for investigation.
Component: Other → Buildduty
Depends on: 868192
Flags: needinfo?(ryanvm)
Flags: needinfo?(gps)
Flags: needinfo?(emorley)
Flags: needinfo?(cbook)
Summary: failure to clobber on try? → "hg purge" causing intermittent problems on linux+osx
Sure :-)
Flags: needinfo?(emorley)
We don't know what the underlying issue was. But considering not using purge with Mercurial or the Git equivalent is costing tons of time in builders (5-10% of total build job time for some builders), I highly encourage the appropriate people to investigate the causes of this. IMO we should start upgrading Mercurial clients to 2.6.3 or 2.7.0 at the earliest convenience (the client version doesn't need to match the server version) - this is something we should be doing anyway, regardless of this bug. We should also investigate selectively enabling purging on platforms that are known to not have problems - it's possible something wonky on some builders/platforms is confusing things.

This bug should also likely also be resolved, possibly duped on bug 851270.
Flags: needinfo?(gps)
Flags: needinfo?(ryanvm)
It sounds like we suspect that older versions of hg have a broken purge in some situations. Is that right? It certainly reads to me like this is about debugging our usage of "hg purge" and then turning it back on, so I'm moving it out of buildduty...
Component: Buildduty → General Automation
Flags: needinfo?(cbook)
Blocks: 851270
The purge code was backed out ages ago, and so we haven't had subsequent issues. We think we've tracked down the cause of the issue in bug 969689, so let's move future discussion there.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: