Closed Bug 808536 Opened 7 years ago Closed 7 years ago

ScriptFactory hg steps should RETRY on "abort: HTTP Error 500: Internal Server Error"

Categories

(Release Engineering :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: aki)

References

(Blocks 1 open bug)

Details

(Whiteboard: [mozharness][sheriff-want])

Attachments

(1 file, 2 obsolete files)

eg:

Rev3 Fedora 12 mozilla-inbound debug test marionette on 2012-11-05 04:24:03 PST for push 31784b0d6334

slave: talos-r3-fed-075

https://tbpl.mozilla.org/php/getParsedLog.php?id=16750774&tree=Mozilla-Inbound

{
========= Started 'hg clone ...' failed (results: 2, elapsed: 1 secs) (at 2012-11-05 04:24:04.634905) =========
hg clone http://hg.mozilla.org/build/mozharness scripts
 in dir /home/cltbld/talos-slave/test/. (timeout 1200 secs)
 watching logfiles {}
 argv: ['hg', 'clone', 'http://hg.mozilla.org/build/mozharness', 'scripts']
 environment:
  CVS_RSH=ssh
  DISPLAY=:0.0
  G_BROKEN_FILENAMES=1
  HISTCONTROL=ignoreboth
  HISTSIZE=1000
  HOME=/home/cltbld
  HOSTNAME=talos-r3-fed-075.build.mozilla.org
  LANG=en_US.UTF-8
  LESSOPEN=|/usr/bin/lesspipe.sh %s
  LOGNAME=cltbld
  MAIL=/var/spool/mail/cltbld
  PATH=/home/cltbld/bin:/tools/buildbot-0.8.0/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
  PWD=/home/cltbld/talos-slave/test
  SHELL=/bin/bash
  SHLVL=1
  SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
  TERM=xterm
  USER=cltbld
  _=/home/cltbld/bin/python
 using PTY: False
abort: HTTP Error 500: Internal Server Error
program finished with exit code 255
elapsedTime=1.282653
========= Finished 'hg clone ...' failed (results: 2, elapsed: 1 secs) (at 2012-11-05 04:24:05.955716) =========
}
found in triage.
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: catlee
Whiteboard: [mozharness] → [mozharness][sheriff-want]
Depends on: 793642
No longer depends on: 793642
Blocks: 770960, 793022
Summary: hg clone http://hg.mozilla.org/build/mozharness should RETRY on "abort: HTTP Error 500: Internal Server Error" → ScriptFactory hg steps should RETRY on "abort: HTTP Error 500: Internal Server Error"
Not as simple as s,ShellCommand,RetryingShellCommand, :

[11:27]	<catlee>	we don't always have retry.py
[11:27]	<catlee>	esp. if the script repo isn't tools
Can we at least make buildbot RETRY using https://hg.mozilla.org/build/buildbotcustom/file/tip/status/errors.py like we do for other "HTTP error .*"?
Q: [11:11]	<aki>	is there a reason we don't use the Mercurial step in ScriptFactory?
A: Because we pass 'branch' in as a property to ScriptFactory, and the Mercurial appends this to the repo path, even if you actually want to check out build/tools or build/mozharness and consider the 'branch' property as something for the Mercurial step to ignore.

If we want to use the Mercurial step, we need to stop passing in 'branch' and perhaps pass in 'branch_name' or something.  Which means anything using ScriptFactory needs to stop relying on 'branch' being named 'branch' and instead look for 'branch_name'.

I imagine that won't be a small change; looking to see how large+ugly it might actually be.
( /usr/local/bin/hg clone --verbose --noupdate http://hg.mozilla.org/build/toolsmozilla-central scripts )

Also, I don't know if this will work on the test slaves, many of which do not have hg in their PATH.  We might be able to get around that via env manipulation.
(it is retrying the build/toolsmozilla-central clone, however.)
1. This is the current plan for renaming branch -> branch_name and using the Mercurial step:

X write test patch to try Mercurial step
X try Mercurial step
_ write test patch to rename branch to branch_name
_ test patch to rename branch to branch_name -- if yes, proceed, if not, ditch Mercurial step and replan
  _ large scale testing for anything that uses ScriptFactory
_ write buildbot-configs patch to add branch_name to properties
_ write (script-side) patch(es) to use branch_name instead of branch
_ write patch to remove branch from properties
_ clean up patch to use Mercurial step
_ possibly more testing
_ roll out branch_name property
_ (wait for reconfig)
_ roll out branch_name scripts
_ remove branch from properties
_ (wait for reconfig)
_ roll out Mercurial step
_ (wait for reconfig)
_ deal with any fallout not found in testing

2. Alternative that isn't as beneficial, but is faster, simpler, and less likely to bork everything:

_ Write a ScriptFactoryMercurialCloneCommand a la http://hg.mozilla.org/build/buildbotcustom/file/b03160f50ca5/steps/source.py#l7 , except instead of wrapping with retry.py, just set buildbot RETRY status and bail out.
_ test
_ roll out
_ (wait for reconfig)

Trying approach #2.
And I don't even have to do that:

http://hg.mozilla.org/build/buildbotcustom/file/b03160f50ca5/process/factory.py#l462
uses MercurialCloneCommand for cloning build/tools, even though you require build/tools to use retry.py.  However, it sets retry=False, which means we fall back to only using the log_eval_func of hg_errors http://hg.mozilla.org/build/buildbotcustom/file/b03160f50ca5/status/errors.py#l12 , which already sets RETRY.

_ change ShellCommand to MercurialCloneCommand(retry=False) for the two hg steps in ScriptFactory
_ test
_ roll out
_ (wait for reconfig)
Sending r? to :bhearsum since he added http://hg.mozilla.org/build/buildbotcustom/annotate/b03160f50ca5/process/factory.py#l462 for bug 613953.
Assignee: nobody → aki
Attachment #681619 - Attachment is obsolete: true
Attachment #681747 - Flags: review?(bhearsum)
I was able to test this by
a) setting mozharness_repo_path to something invalid "users/asasaki_mozilla.com/nonexistent"
b) adding re.compile('404') to the list of RETRY errors in buildbotcustom/status/errors.py
c) kicked off an android l10n nightly and watched it retry over and over til I removed the 404 from the RETRY list and reconfiged, then it went red.
Blocks: 812149
Comment on attachment 681747 [details] [diff] [review]
use MercurialCloneCommand(retry=False)

Review of attachment 681747 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/factory.py
@@ +6225,5 @@
>              workdir=".",
> +            haltOnFailure=True,
> +            retry=False,
> +        ))
> +        self.addStep(MercurialCloneCommand(

What's the reasoning behind using MercurialCloneCommand instead of ShellCommand? There's no failure mode here that's worth retrying AFAIK.
(In reply to Ben Hearsum [:bhearsum] from comment #21)
> Comment on attachment 681747 [details] [diff] [review]
> use MercurialCloneCommand(retry=False)
> 
> Review of attachment 681747 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: process/factory.py
> @@ +6225,5 @@
> >              workdir=".",
> > +            haltOnFailure=True,
> > +            retry=False,
> > +        ))
> > +        self.addStep(MercurialCloneCommand(
> 
> What's the reasoning behind using MercurialCloneCommand instead of
> ShellCommand? There's no failure mode here that's worth retrying AFAIK.

for the update?
I can change that back easily if you want.
Attachment #681747 - Attachment is obsolete: true
Attachment #681747 - Flags: review?(bhearsum)
Attachment #682068 - Flags: review?(bhearsum)
Attachment #682068 - Flags: review?(bhearsum) → review+
Comment on attachment 682068 [details] [diff] [review]
only MercurialCloneCommand the clone

Thanks Ben!

http://hg.mozilla.org/build/buildbotcustom/rev/97dd45bbc94a

pending reconfig
Attachment #682068 - Flags: checked-in+
In production.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.