Closed Bug 613953 Opened 9 years ago Closed 9 years ago

Build / repack uploads should be retried if they fail

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: bhearsum)

References

Details

(Whiteboard: [automation])

Attachments

(2 files, 3 obsolete files)

Regular and release builds and repacks often hit problems uploading to stage (bug 610399).

Instead of dying, we should re-try the upload a few times.
Blocks: 478420
OS: Linux → All
Priority: -- → P3
Hardware: x86_64 → All
Planning to fix this this quarter.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
The repack part of this was fixed in bug 613970.
Depends on: 613970
This patch adds retry.py to a bunch of places, including (most) uploads. Still doing a bunch of testing on it.
Attached patch refined version (obsolete) — Splinter Review
Due to the upload_errors stuff I added in bug 661401 this got a bit more complicated. In cases where we have to retry at least once, but succeed in the end, there will be error messages caught by the log_eval_func and the overall status will get set to RETRY. To workaround that, all of the Retrying* steps will always succeed if the return code is 0, putting the onus on retry.py to exit correctly.

I tested this by using 'iptables -A OUTPUT -d 10.2.71.82 -j REJECT' on mv-moz2-linux-ix-slave01 to cause quick Connection Refused messages. I tested the succeeds-on-first-attempt, succeeds-on-subsequent-attempt, and never-succeeds scenarios. The first two resulted in SUCCESS, the latter in RETRY.
Attachment #537775 - Attachment is obsolete: true
Attachment #538049 - Flags: review?(catlee)
Attachment #538049 - Flags: review?(catlee) → review+
Comment on attachment 538049 [details] [diff] [review]
refined version

Landed on the default branch of buildbotcustom.
Attachment #538049 - Flags: checked-in+
This made it to production.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Backed out due to posix path to retry.py borking 192 win32 .
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I can't find any errors like the one you describe on http://tbpl.mozilla.org/?tree=Firefox3.6, can you point me at the specific issue?
Attachment #538049 - Flags: checked-in+ → checked-in-
Found it: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1307745554.1307746489.759.gz&fulltext=1

python /e/builds/moz2_slave/192-w32-unittest/tools/buildfarm/utils/retry.py -s 1 -r 5 python 192-w32-unittest/tools/clobberer/clobberer.py -s tools -t 168 http://build.mozilla.org/clobberer/index.php mozilla-1.9.2 WINNT 5.2 mozilla-1.9.2 unit test 192-w32-unittest mw32-ix-slave14 http://buildbot-master08.build.scj1.mozilla.com:8001/
 in dir e:\builds\moz2_slave\192-w32-unittest\.. (timeout 3600 secs)
 watching logfiles {}
 argv: ['python', '/e/builds/moz2_slave/192-w32-unittest/tools/buildfarm/utils/retry.py', '-s', '1', '-r', '5', 'python', '192-w32-unittest/tools/clobberer/clobberer.py', '-s', 'tools', '-t', '168', 'http://build.mozilla.org/clobberer/index.php', 'mozilla-1.9.2', 'WINNT 5.2 mozilla-1.9.2 unit test', '192-w32-unittest', 'mw32-ix-slave14', u'http://buildbot-master08.build.scj1.mozilla.com:8001/']

python: can't open file '/e/builds/moz2_slave/192-w32-unittest/tools/buildfarm/utils/retry.py': [Errno 2] No such file or directory
program finished with exit code 2
elapsedTime=0.110000
Same as before, except I've added the "pwd -W" toolsdir fix in UnittestBuildFactory, because it uses MercurialCloneCommand (which uses retry.py), and I've turned MozillaClobberer back into a ShellCommand, because toolsdir isn't set properly on Windows in MozillaBuildFactory. I ran some 1.9.2 and m-c builds in staging, including unittests - they worked fine.
Attachment #538926 - Flags: review?(catlee)
Attachment #538926 - Flags: review?(catlee) → review+
Attachment #538049 - Attachment is obsolete: true
Comment on attachment 538926 [details] [diff] [review]
full patch, with fix for UnittestBuildFactory/MozillaClobber

This is on the default branch again, heading to production later today.
Attachment #538926 - Flags: checked-in+
Haven't seen any additional fallout.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
this is causing errors with make upload in scratchbox commands.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch fix scratchboxcommand fallout (obsolete) — Splinter Review
This patch should fix the root problem.  I haven't had a chance to run test_masters yet, but I have tested that the list comprehension is valid

>>> [str(x) for x in [1,2,3,'john']]
['1', '2', '3', 'john']
Attachment #539246 - Flags: review?(bhearsum)
Comment on attachment 539246 [details] [diff] [review]
fix scratchboxcommand fallout

I don't want to take this as a bustage fix. I'm attaching a safer fix.
Comment on attachment 539248 [details] [diff] [review]
work around limitations by forcing timeout to be a string

r+ for bustage
Attachment #539248 - Flags: review?(jhford) → review+
Comment on attachment 539246 [details] [diff] [review]
fix scratchboxcommand fallout

We didn't end up using this patch.
Attachment #539246 - Attachment is obsolete: true
Attachment #539246 - Flags: review?(bhearsum)
Comment on attachment 539248 [details] [diff] [review]
work around limitations by forcing timeout to be a string

This was landed a few days ago.
Attachment #539248 - Flags: checked-in+
This is all done, again. bug 664211
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.