Closed Bug 695467 Opened 13 years ago Closed 13 years ago

Bm builders are hitting mercurial bugs & failing (turning blue)

Categories

(Release Engineering :: General, defect, P3)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 693202

People

(Reporter: dholbert, Assigned: bkero)

References

Details

(Whiteboard: [buildduty][hg])

We seem to be hitting a mercurial bug on Mozilla-Beta and Mozilla-Aurora, triggering tons of (automatically-retriggered) Bm builds.

e.g.
https://tbpl.mozilla.org/php/getParsedLog.php?id=6914607&tree=Mozilla-Beta
OS X 10.5.2 Mobile Desktop mozilla-beta build on 2011-10-18 12:19:07 PDT for push 72be1d924c35

https://tbpl.mozilla.org/php/getParsedLog.php?id=6913654&tree=Mozilla-Beta
WINNT 5.2 Mobile Desktop mozilla-beta build on 2011-10-18 11:36:38 PDT for push 52c9c801be77

The error looks like this:
{
 argv: ['/usr/local/bin/hg', 'clone', '--verbose', '--noupdate', u'http://hg.mozilla.org/releases/mozilla-beta', 'build']
 environment:
  Apple_PubSub_Socket_Render=/tmp/launch-O1SyV8/Render
  CVS_RSH=ssh
  DISPLAY=/tmp/launch-wS10Xl/:0
  HOME=/Users/cltbld
  LOGNAME=cltbld
  PATH=/tools/buildbot/bin:/tools/python/bin:/opt/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/builds/slave/m-beta-osx-mb
  SHELL=/bin/bash
  SSH_AUTH_SOCK=/tmp/launch-bqv65a/Listeners
  TMPDIR=/var/folders/TL/TLg3RrMbFAur2hBCXvCeqk+++TM/-Tmp-/
  USER=cltbld
  __CF_USER_TEXT_ENCODING=0x1F6:0:0
 using PTY: False
transaction abort!
requesting all changes
adding changesets
rollback completed
** unknown exception encountered, please report by visiting
**  http://mercurial.selenic.com/wiki/BugTracker
** Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17) [GCC 4.0.1 (Apple Inc. build 5465)]
** Mercurial Distributed SCM (version 1.7.5)
** Extensions loaded: share, rebase, mq, purge
Traceback (most recent call last):
  File "/usr/local/bin/hg", line 38, in <module>
    mercurial.dispatch.run()
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/dispatch.py", line 16, in run
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/dispatch.py", line 36, in dispatch
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/dispatch.py", line 58, in _runcatch
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/dispatch.py", line 593, in _dispatch
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/dispatch.py", line 401, in runcommand
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/dispatch.py", line 644, in _runcommand
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/dispatch.py", line 598, in checkargs
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/dispatch.py", line 591, in <lambda>
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/util.py", line 426, in check
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/commands.py", line 736, in clone
    
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/hg.py", line 337, in clone
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/localrepo.py", line 1886, in clone
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/localrepo.py", line 1295, in pull
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/localrepo.py", line 1692, in addchangegroup
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/revlog.py", line 1381, in addgroup
  File "tools/mercurial-1.7.5/lib/python2.5/site-packages/mercurial/revlog.py", line 1220, in _addrevision
mpatch.mpatchError: patch cannot be decoded
elapsedTime=1044.706320
program finished with exit code 1
}
cc-ing bkero because he upgraded varnish in bug 693202 this morning
Priority: -- → P3
Whiteboard: [buildduty][hg]
The problem appears to have started today.

The last-good push to Mozilla-Beta was last Thursday:
   https://tbpl.mozilla.org/?tree=Mozilla-Beta&rev=522217082f0d
The first-bad push was this morning at 11:00 AM:
   https://tbpl.mozilla.org/?tree=Mozilla-Beta&rev=df9841857c9c
(nothing else was pushed between Thursday and today)

On Mozilla-Aurora, the last-good push appears to be this morning at 11:11 AM:
   https://tbpl.mozilla.org/?tree=Mozilla-Aurora&rev=4754469691db
and the first-bad push was today at 12:16 PM:
   https://tbpl.mozilla.org/?tree=Mozilla-Aurora&rev=54a04805efe1
bkero: are there scripts involved here that need to be updated, like those for comm-beta this morning>
coop: I'm wondering if the scripts I updated were the same as the comm-beta ones.  Is it possible to rerun this job to see if this problem was fixed with the scripts that I updated for comm-beta?
So, these mobile desktop builders (Bm) do a regular 'hg clone .../releases/mozilla-beta', instead of using our hgtool.py which uses hg share where it can. Consequently they will cause more traffic than the other builds (B), and the constant retrying could lead to a situation where you never get out of a broken state. The slaves having issues are located in SJC1, so the traffic is intra-colo.

We know we changed varnish this morning, and we're consistently getting 
  mpatch.mpatchError: patch cannot be decoded

Can we try dumping all the pages for mozilla-beta and mozilla-aurora in the varnish cache ?
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Assignee: server-ops-releng → bkero
This is happening on the main mozilla-central and mozilla-inbound trees too.
Removing specific mention of Mozilla-Beta from summary.
Summary: Bm builders on Mozilla-Beta are hitting mercurial bugs & failing (turning blue) → Bm builders are hitting mercurial bugs & failing (turning blue)
I've dumped all of the varnish cache to see if helps resolve this problem.

I have been attempting to replicate the issue.  I've done a duplicate clone on a separate varnish instance (on an unrelated VM) and did not observe the issue.

At this point I think the check might be related to how the cache is expired.  I'll be investigating that.
Component: Server Operations: RelEng → Release Engineering
Ok, thanks. It would be good to know that the cache eviction on a push is working with the new version of varnish.
This is happening still.  See https://tbpl.mozilla.org/?noignore=1&tree=Mozilla-Beta&rev=6120192ea12e

The first build that failed on this branch showed
/usr/local/bin/hg clone --verbose --noupdate http://hg.mozilla.org/releases/mozilla-beta build
<snip>
requesting all changes
abort: HTTP Error 503: Service Unavailable
elapsedTime=930.644596
program finished with exit code 255

The following builds show the traceback in comment 0.
https://tbpl.mozilla.org/php/getParsedLog.php?id=6925363&tree=Mozilla-Aurora

(minus some of the noise)
/usr/local/bin/hg clone --verbose --noupdate http://hg.mozilla.org/releases/mozilla-aurora build
requesting all changes
adding changesets
adding manifests
adding file changes
added 58555 changesets with 0 changes to 0 files (+27 heads)
elapsedTime=208.757507

/usr/local/bin/hg identify --num --branch
-1 default

/usr/local/bin/hg update --clean --repository build --rev 0cb1870e32d2d63b380f48bd30e1f8e281dbd5ec
abort: unknown revision '0cb1870e32d2d63b380f48bd30e1f8e281dbd5ec'!
... and that was the 49th attempt at building for that push :)
(In reply to Phil Ringnalda (:philor) from comment #12)
> ... and that was the 49th attempt at building for that push :)

The darwin9 clones are succeeding now after downgrading varnish. This build should finally persevere.
I'm going to dup this, since it's clearly related to / a symptom of changes in bug 693202.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → DUPLICATE
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.