Closed Bug 1025842 Opened 6 years ago Closed 6 years ago

mock 'archives' are fragile on spot instances

Categories

(Release Engineering :: General, defect, blocker)

x86
Linux
defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: nthomas, Unassigned)

References

Details

Bug 1024962 was try spot in us-west-2. Today we have bld spot in us-west-2 showing similar symptoms - eg bld-linux64-spot-443, 459, 477, 320, 473, 466 all failing to unpack the root archive mock caches in what look newly launched spot instances. The failure is quick, and the next build is successful.

I had assumed it was bug 1023477 but it seems we have backed out the fs changes there.
seems this hit more and more slaves
Severity: normal → critical
and closed inbound now due to a higher failure rate
(In reply to Carsten Book [:Tomcat] from comment #2)
> and closed inbound now due to a higher failure rate

update 6:20 - all integration trees are closed since this problem is spreading 

affected builders as example
slave: bld-linux64-spot-1002 - https://tbpl.mozilla.org/php/getParsedLog.php?id=41788883&tree=Mozilla-Inbound&full=1

slave: bld-linux64-spot-431 - https://tbpl.mozilla.org/php/getParsedLog.php?id=41788917&tree=Mozilla-Inbound&full=1

slave: bld-linux64-spot-1001 - https://tbpl.mozilla.org/php/getParsedLog.php?id=41789434&tree=Fx-Team&full=1

slave: bld-linux64-spot-075 - https://tbpl.mozilla.org/php/getParsedLog.php?id=41789426&tree=Fx-Team

slave: bld-linux64-spot-175 - https://tbpl.mozilla.org/php/getParsedLog.php?id=41789338&tree=Fx-Team
Severity: critical → blocker
Checking...
I again suspect changes in bug 1023477...
Blocks: 1023477
I'm still seeing failures:
https://tbpl.mozilla.org/php/getParsedLog.php?id=41792604&tree=Mozilla-Inbound

Are these just from old instances?
yes, just checked bld-linux64-spot-092, it was based on one of the "bad" AMIs.
Have we purged all the instances bad on the bad AMI(s) ?
(In reply to Nick Thomas [:nthomas] from comment #8)
> Have we purged all the instances bad on the bad AMI(s) ?

http://hg.mozilla.org/build/cloud-tools/file/default/scripts/aws_terminate_by_ami_id.py
bld-linux64-spot-494 looks to have failed 4 times in a similar manner as ones from https://bugzilla.mozilla.org/show_bug.cgi?id=1025842#c3

also, might be related, bld-linux64-spot-136 had issues with zip/objcopy step in 'make buildsymbols'.

How can I get the list of 'bad' AMI's so I can verify either of these?

rail - I'm guessing you have ran that script against all known bad ami's so these spot hostnames might be good now?
I think this may be affecting more than just spot instances.

Single locale android repacks - at least the ones that run on ix machines - are failing because they never run "mock install". Eg: http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-aurora-l10n/mozilla-aurora-android-l10n_3-unknown-bm71-build1-build1.txt.gz

mock install was never run (presumably because this condition failed: http://mxr.mozilla.org/build-central/source/mozharness/mozharness/mozilla/mock.py#200), and autoconf213 was never installed.
No longer blocks: 1026480
Blocks: 1025801
Seems to be working better now.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.