investigate build failures on b-2008 instances

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: markco, Assigned: markco)

Tracking

Details

(Assignee)

Description

2 years ago
We have 10 completed non-try builds. 5 were successful, and 5 has failed. 3 had busted on other builds. I have not been able to determine the reason for failure on the other 2. 

http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%206.1%20x86-64%20mozilla-inbound%20leak%20test%20build/builds/2

Branch
integration/mozilla-inbound
Revision
8d20513ae79ed8bf3366536f318935bdab1ed9f2
Got Revision
8d20513ae79e
Successful on other platforms
Undetermined reasons for failure. 


http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/TB%20WINNT%205.2%20comm-central%20build/builds/0

Got Revision
7a674f9ee355
Busted on all platforms 
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=7a674f9ee355



http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20build/builds/2

Branch
integration/mozilla-inbound
Revision
6d7d90a28e057220e59988c6fca3ed5f20bacea3
Got Revision
6d7d90a28e05
All other builds successful 
Undetermined reasons for failure. 

http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/TB%20WINNT%205.2%20comm-central%20build/builds/0 

Branch
comm-central
Revision
7a674f9ee35503b88660df41e4616967b72ad765
Got Revision
7a674f9ee355
Busted on all platforms 
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=7a674f9ee355


http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20build/builds/0

Branch
integration/mozilla-inbound
Revision
8d0a776c5d1e7ad6929e6a5e6afeff75b67c3886
Got Revision
8d0a776c5d1e
Busted on XP debug as well 
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8d0a776c5d1e
(Assignee)

Updated

2 years ago
Assignee: relops → mcornmesser
(Assignee)

Comment 1

2 years ago
All the failed builds from the last 24 hours have been related to a known issue with the l10 dep builds. However, there has not been enough non-l10 builds to gauge if we are safe to significantly expand non-try building in AWS. 

I am going to move forward with a 3 machine test pool, b-2008-spot-002 through 004.
(Assignee)

Comment 2

2 years ago
Pgo builds are timing out: 

http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20pgo-build/builds/0

16:00:01     INFO -  mozmake.EXE[3]: Leaving directory 'c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox'
16:00:01     INFO -  set -e; \
16:00:01     INFO -  for mkfile in build/sccache.mk; do \
16:00:01     INFO -    c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/mozmake.EXE -f c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/$mkfile postflight_all TOPSRCDIR=c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src OBJDIR=c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox MOZ_OBJDIR=c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox; \
16:00:01     INFO -  done
16:00:01     INFO -  mozmake.EXE[3]: Entering directory 'c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src'
16:00:01     INFO -  # Terminate sccache server. This prints sccache stats.
16:00:01     INFO -  python2.7 c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/sccache/sccache.py 2>&1 | gzip > c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox/dist/sccache.log.gz

command timed out: 19800 seconds elapsed running ['c:/mozilla-build/python27/python', '-u', 'scripts/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_windows_32_builds.py', '--config', 'balrog/production.py', '--branch', 'mozilla-inbound', '--build-pool', 'production', '--enable-pgo'], attempting to kill
SIGKILL failed to kill process
using fake rc=-1
program finished with exit code -1

remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last):
Failure: exceptions.RuntimeError: SIGKILL failed to kill process
(Assignee)

Comment 3

2 years ago
From irc conversation, this may have been caused by insufficient ram. Moving the instance type to c3.2xlarge. This will increase the ram from 7.5 to 15 gigs. 

Also the display maybe an issue for the PGO builds.
(Assignee)

Comment 4

2 years ago
In the last 2 days there has still been many l10 builds failing on the get mar bit, but there are several l10 builds on hardware failing on the same error at the same place. 

There was a sendchange fail on this build http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%206.1%20x86-64%20mozilla-beta%20leak%20test%20build/builds/0 
It is suspected this was due to a malformed directory in the PATH. I have updated the path to include C:\mozilla-build\buildbotve\scripts and recaptured the base image: b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER9-2015-11-13 (ami-563b403c). There are now 4 instances up based on this image, b-2008-spot-002, 004 through 6. 

Most recent PGO build failed despite the changing the instance type to c3.2xlarge:
 http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%206.1%20x86-64%20mozilla-beta%20leak%20test%20build/builds/0
 

Outside of what has been mention above, rest of builds are mainly coming up green.
(Assignee)

Comment 5

2 years ago
Correction: Most recent PGO build failed despite the changing the instance type to c3.2xlarge: http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20fx-team%20pgo-build/builds/0
(Assignee)

Comment 6

2 years ago
Catlee pointed out in the log from comment 5: 

12:08:16     INFO -  Warning: C4146 in c:\builds\moz2_slave\fx-team-w32-pgo-00000000000000\build\src\js\src\jit\x86-shared\AtomicOperations-x86-shared.h: unary minus operator applied to unsigned type, result still unsigned
12:08:16     INFO -  c:\builds\moz2_slave\fx-team-w32-pgo-00000000000000\build\src\js\src\jit/x86-shared/AtomicOperations-x86-shared.h(522) : warning C4146: unary minus operator applied to unsigned type, result still unsigned
12:08:16     INFO -  c:/builds/moz2_slave/fx-team-w32-pgo-00000000000000/build/src/js/src/jsgc.cpp(4801) : fatal error C1088: Cannot flush compiler intermediate file: 'C:/Users/cltbld/AppData/Local/Temp\_CL_6070deebsy': No space left on device
12:08:16     INFO -  Unified_cpp_js_src24.cpp

Which is strange because there are 20+ gigs available on the machine.
(Assignee)

Comment 7

2 years ago
Oddly enough it does seems like the instance had run out of space. Moving forward from here, in attempt not use a larger initial drive, I am going to clean up the instance, increase the purge build target to 30 to 35, and see how that works out.
(Assignee)

Comment 8

2 years ago
The AMI now has 50+ gigs of free space, and the purge build target is now 35 gigs. I am now spinning up b-2008-spot 002 through 006.

AMI ID
b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER12-2015-11-16 (ami-fab5cf90)
(Assignee)

Comment 9

2 years ago
With the new AMI there has been no failures on 003, 004, or 005. The other 2 did have failing builds. 

3 failure on b-2008-spot-002. All with this error: 

14:53:56     INFO -  nsWin32Locale.cpp
14:53:56     INFO -  python2.7 c:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/sccache/sccache.py cl -FoUnified_cpp_ipc_glue0.obj -c -I../../dist/stl_wrappers  -DWIN32_LEAN_AND_MEAN -D_WIN32 -DWIN32 -D_CRT_RAND_S -DCERT_CHAIN_PARA_HAS_EXTRA_FIELDS -DOS_WIN=1 -D_UNICODE -DCHROMIUM_BUILD -DU_STATIC_IMPLEMENTATION -DUNICODE -D_WINDOWS -D_SECURE_ATL -DCOMPILER_MSVC -DMOZ_CHILD_PROCESS_NAME='"plugin-container.exe"' -DMOZ_CHILD_PROCESS_NAME_PIE='""' -DMOZ_CHILD_PROCESS_BUNDLE='"plugin-container.app/Contents/MacOS/"' -DDLL_PREFIX='""' -DDLL_SUFFIX='".dll"' -DSTATIC_EXPORTABLE_JS_API -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/ipc/glue -I. -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/caps -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/dom/broadcastchannel -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/dom/indexedDB -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/dom/workers -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/media/webrtc/trunk -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/xpcom/build -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/toolkit/xre -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/xpcom/threads -I../../ipc/ipdl/_ipdlheaders -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/ipc/chromium/src -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/ipc/glue -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/toolkit/crashreporter -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/security/sandbox/chromium -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/security/sandbox/chromium-shim -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/security/sandbox/win/src/sandboxbroker -I../../dist/include  -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/obj-firefox/dist/include/nspr -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/obj-firefox/dist/include/nss        -MD           -FI ../../dist/include/mozilla-config.h -DMOZILLA_CLIENT -deps.deps/Unified_cpp_ipc_glue0.obj.pp  -TP -nologo -D_HAS_EXCEPTIONS=0 -W3 -Gy -arch:IA32 -FS -wd4251 -wd4244 -wd4267 -wd4345 -wd4351 -wd4800 -wd4819 -we4553 -GR-  -DNDEBUG -DTRIMMED -Z7 -UDEBUG -DNDEBUG -O1 -Oi -Oy- -WX     c:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/obj-firefox/ipc/glue/Unified_cpp_ipc_glue0.cpp
14:53:56     INFO -     Creating library ../lib/icuin.lib and object ../lib/icuin.exp
14:53:57     INFO -  CVTRES : fatal error CVT1107: 'c:\builds\moz2_slave\m-in-w32-000000000000000000000\build\src\obj-firefox\intl\icu\target\i18n\utf16collationiterator.o' is corrupt
14:53:57     INFO -  LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt
14:53:57     INFO -  Makefile:185: recipe for target '../lib/icuin55.dll' failed
14:53:57     INFO -  mozmake.EXE[7]: *** [../lib/icuin55.dll] Error 99


b-2008-spot-006 had a build with warnings: 

http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20build/builds/0 . 

It failed on sendchange:
 Unable to successfully run ['buildbot', 'sendchange', '--master', 'buildbot-master81.build.mozilla.org:9301', '--username', 'sendchange-unittest', '--branch', 'mozilla-beta-win32-pgo-unittest', '--revision', 'c66289e84c50', '--comments', 'Bug 1217047 - try harder in IsContractIDRegistered to return a reasonable answer_ r=bsmedberg,f=yury a=lizzard', '--property', 'buildid:20151116124336', '--property', 'pgo_build:True', '--property', 'builduid:65a1d021a6db4a43a6652e11a0d549fe', 'http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-win32/1447706616/firefox-43.0.en-US.win32.zip', 'http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-win32/1447706616/firefox-43.0.en-US.win32.web-platform.tests.zip'] after 5 attempts

However sendchange did work when the command was ran manually. A re-run of the build is currently pending. I am going to wait on that before diving into this deeper.
(Assignee)

Comment 10

2 years ago
 Q, I think you have dealt with this in the past: 

> 14:53:57     INFO -  CVTRES : fatal error CVT1107:
> 'c:\builds\moz2_slave\m-in-w32-000000000000000000000\build\src\obj-
> firefox\intl\icu\target\i18n\utf16collationiterator.o' is corrupt
> 14:53:57     INFO -  LINK : fatal error LNK1123: failure during conversion
> to COFF: file invalid or corrupt
> 14:53:57     INFO -  Makefile:185: recipe for target '../lib/icuin55.dll'
> failed

I thought i had this worked around with this: http://hg.mozilla.org/build/puppet/file/tip/modules/tweaks/manifests/vs_2013_lnk.pp

Is there a piece I am missing?
Flags: needinfo?(q)
You're missing what that failure is saying: the run before those three, 002 disconnected in the middle of a build, in fact, in the middle of building ICU. You don't want that to happen, because our build system sucks, but not as badly as ICU's build system sucks. The existing solution for that, depending on who noticed it and what time of day it was, would be to either disable the slave and have releng either remove the objdir or (more likely since it's one button click) reimage the slave, or, use the clobberer to remove the m-i opt objdir from every slave. Since I saw this one and have a deep fondness for terminating things that are making a mess of production jobs, I gave it a touch of the Terminate button on https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=b-2008-spot&name=b-2008-spot-002, and then began feeling a bit guilty that it might not be as automatically regenerated as production AWS instances are.
(Assignee)

Updated

2 years ago
Flags: needinfo?(q)
(Assignee)

Comment 12

2 years ago
Ah OK. I will blow that instance away and recreate it in the am.
(Assignee)

Comment 13

2 years ago
I am going to add yasm 1.3 and new GLS key to this testing image.
(Assignee)

Comment 14

2 years ago
New AMI: b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER13-2015-11-17 (ami-8d2560e7)
the b-2008-spot stuff in cloud-tools is all ready to go. just a case of enabling instances in slavealloc and they should spin up within 30 ~ 60 minutes depending on load. You may have to disable some ix instances if there isn't enough load.

You have to address the individual instances in slavealloc (or go direct to mysql), because the group pages will only populate after some builds have completed, eg:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-spot-101
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-spot-102
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-spot-103
etc
note that spot:
001 to 100 are in us-east-1
101 to 200 are in us-west-2

I would enable instances in both regions to get the most thorough test results.
(Assignee)

Comment 17

2 years ago
For the time being I am holding off on the additional roll out. I am seeing this error pop on multiple builds on multiple instances this morning: 

14:33:11     INFO -  mozmake.EXE[2]: Leaving directory 'c:/builds/moz2_slave/date-w32-000000000000000000000/build/src/obj-firefox/browser/locales'
14:33:11     INFO -  mozmake.EXE: Leaving directory 'c:/builds/moz2_slave/date-w32-000000000000000000000/build/src/obj-firefox'
15:53:11     INFO - Automation Error: mozprocess timed out after 4800 seconds running ['c:\\mozilla-build\\python27\\python.exe', 'mach', '--log-no-times', 'build', '-v']
15:53:11    ERROR - timed out after 4800 seconds of no output
15:53:11    ERROR - Return code: 572
15:53:11  WARNING - setting return code to 2
15:53:11    FATAL - 'mach build' did not run successfully. Please check log for errors.
15:53:11    FATAL - Running post_fatal callback...

http://buildbot-master71.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20date%20build/builds/0
http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/WINNT%206.1%20x86-64%20mozilla-inbound%20build/builds/3
http://buildbot-master94.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20leak%20test%20build/builds/1
http://buildbot-master94.bb.releng.use1.mozilla.com:8001/builders/Win32%20Mulet%20mozilla-central%20nightly/builds/0
http://buildbot-master71.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20pgo-build/builds/0
http://buildbot-master71.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20date%20build/builds/0
(Assignee)

Comment 18

2 years ago
Enabled share hg extension and recaptured the image: 

b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER14-2015-11-19 (ami-0f5d1965)
(Assignee)

Comment 19

2 years ago
Disabled the Windows Defender service to hopefully cut down on some of the time outs:

b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER15-2015-11-20 (ami-cdeeaaa7)
(Assignee)

Comment 20

2 years ago
With the latest AMI builds have been mostly green or failing on the known issue of get mar. There is one build that failed for another reason: 

http://buildbot-master94.bb.releng.use1.mozilla.com:8001/builders/mozilla-release%20hg%20bundle/builds/0

downloading bundle https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/releases/mozilla-release/c7aea035aad1b34bbd492f5cd2236ad6e318c5a5.stream-legacy.hg
streaming all changes
215964 files to transfer, 1.54 GB of data
transferred 1.54 GB in 236.7 seconds (6.66 MB/sec)
finishing applying bundle; pulling
searching for changes
no changes found
updating to branch default
125848 files updated, 0 files merged, 0 files removed, 0 files unresolved
291344 changesets found
scp: /home/ftp/pub/firefox/bundles/mozilla-release.hg.upload: No such file or directory
program finished with exit code 1
(Assignee)

Comment 21

2 years ago
Mozilla-beta builds are failing on sendchange. Looking into this I found that the PATH environment Buildbot is using does not match the instances default value and is malformed (missing \ in various locations): 

PATH=C:\mozilla-buildpython27;C:\mozilla-buildbuildbotve\scripts;C:\mozilla-build\msys\local\bin;c:\mozilla-build\wget;c:\mozilla-build\7zip;c:\mozilla-build\blat261\full;c:\mozilla-build\python;c:\mozilla-build\svn-win32-1.6.3\bin;c:\mozilla-build\upx203w;c:\mozilla-build\emacs-24.3\bin;c:\mozilla-build\info-zip;c:\mozilla-build\nsis-2.46u;c:\mozilla-build\nsis-3.0a2;c:\mozilla-build\wix-351728;c:\mozilla-build\hg;c:\mozilla-build\python\Scripts;c:\mozilla-build\kdiff3;c:\mozilla-build\yasm;c:\mozilla-build\mozmake;.;C:\mozilla-build\msys\local\bin;C:\mozilla-build\msys\mingw\bin;C:\mozilla-build\msys\bin;c:\Program Files (x86)\Puppet Labs\Puppet\puppet\bin;c:\Program Files (x86)\Puppet Labs\Puppet\facter\bin;c:\Program Files (x86)\Puppet Labs\Puppet\hiera\bin;c:\Program Files (x86)\Puppet Labs\Puppet\bin;c:\Program Files (x86)\Puppet Labs\Puppet\sys\ruby\bin;c:\Program Files (x86)\Puppet Labs\Puppet\sys\tools\bin;c:\Windows\system32;c:\Windows;c:\Windows\System32\Wbem;c:\Windows\System32\WindowsPowerShell\v1.0\;c:\Windows\System32\WindowsPowerShell\v1.0\;c:\Program Files\Amazon\cfn-bootstrap\;c:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\;c:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\;c:\Program Files\Microsoft SQL Server\100\Tools\Binn\;c:\Program Files\Microsoft SQL Server\100\DTS\Binn\;c:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\;c:\Program Files\Microsoft SQL Server\110\Tools\Binn\;c:\Program Files (x86)\Microsoft SDKs\TypeScript\1.0\;c:\Program Files (x86)\Puppet Labs\Puppet\bin;c:\opt\runner;c:\mozilla-build\buildbotve;c:\mozilla-build\python27;C:\mozilla-build\msys\mingw\bin;C:\mozilla-build\msys\bin;C:\mozilla-build\msys\local\bin;.;C:\mozilla-build\msys\local\bin;c:\mozilla-build\moztools-x64\bin;c:\mozilla-build\vim\vim72

I am not sure where it is getting this value at or why it is different from the IX machines.
(Assignee)

Comment 22

2 years ago
Grenade has hacked around this for now in the usedata: 

+  Create-SymbolicLink -link 'C:\mozilla-buildpython27' -target 'C:\mozilla-build\python27'
+  Create-SymbolicLink -link 'C:\mozilla-buildbuildbotve' -target 'C:\mozilla-build\buildbotve'

hopefully that will work as temp work around until we can figure how buildbot is getting the value mention above.
(Assignee)

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.