Closed Bug 1222556 Opened 10 years ago Closed 10 years ago

investigate build failures on b-2008 instances

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: markco)

Details

We have 10 completed non-try builds. 5 were successful, and 5 has failed. 3 had busted on other builds. I have not been able to determine the reason for failure on the other 2. http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%206.1%20x86-64%20mozilla-inbound%20leak%20test%20build/builds/2 Branch integration/mozilla-inbound Revision 8d20513ae79ed8bf3366536f318935bdab1ed9f2 Got Revision 8d20513ae79e Successful on other platforms Undetermined reasons for failure. http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/TB%20WINNT%205.2%20comm-central%20build/builds/0 Got Revision 7a674f9ee355 Busted on all platforms https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=7a674f9ee355 http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20build/builds/2 Branch integration/mozilla-inbound Revision 6d7d90a28e057220e59988c6fca3ed5f20bacea3 Got Revision 6d7d90a28e05 All other builds successful Undetermined reasons for failure. http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/TB%20WINNT%205.2%20comm-central%20build/builds/0 Branch comm-central Revision 7a674f9ee35503b88660df41e4616967b72ad765 Got Revision 7a674f9ee355 Busted on all platforms https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=7a674f9ee355 http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20build/builds/0 Branch integration/mozilla-inbound Revision 8d0a776c5d1e7ad6929e6a5e6afeff75b67c3886 Got Revision 8d0a776c5d1e Busted on XP debug as well https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8d0a776c5d1e
Assignee: relops → mcornmesser
All the failed builds from the last 24 hours have been related to a known issue with the l10 dep builds. However, there has not been enough non-l10 builds to gauge if we are safe to significantly expand non-try building in AWS. I am going to move forward with a 3 machine test pool, b-2008-spot-002 through 004.
Pgo builds are timing out: http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20pgo-build/builds/0 16:00:01 INFO - mozmake.EXE[3]: Leaving directory 'c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox' 16:00:01 INFO - set -e; \ 16:00:01 INFO - for mkfile in build/sccache.mk; do \ 16:00:01 INFO - c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/mozmake.EXE -f c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/$mkfile postflight_all TOPSRCDIR=c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src OBJDIR=c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox MOZ_OBJDIR=c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox; \ 16:00:01 INFO - done 16:00:01 INFO - mozmake.EXE[3]: Entering directory 'c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src' 16:00:01 INFO - # Terminate sccache server. This prints sccache stats. 16:00:01 INFO - python2.7 c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/sccache/sccache.py 2>&1 | gzip > c:/builds/moz2_slave/m-in-w32-pgo-00000000000000000/build/src/obj-firefox/dist/sccache.log.gz command timed out: 19800 seconds elapsed running ['c:/mozilla-build/python27/python', '-u', 'scripts/scripts/fx_desktop_build.py', '--config', 'builds/releng_base_windows_32_builds.py', '--config', 'balrog/production.py', '--branch', 'mozilla-inbound', '--build-pool', 'production', '--enable-pgo'], attempting to kill SIGKILL failed to kill process using fake rc=-1 program finished with exit code -1 remoteFailed: [Failure instance: Traceback from remote host -- Traceback (most recent call last): Failure: exceptions.RuntimeError: SIGKILL failed to kill process
From irc conversation, this may have been caused by insufficient ram. Moving the instance type to c3.2xlarge. This will increase the ram from 7.5 to 15 gigs. Also the display maybe an issue for the PGO builds.
In the last 2 days there has still been many l10 builds failing on the get mar bit, but there are several l10 builds on hardware failing on the same error at the same place. There was a sendchange fail on this build http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%206.1%20x86-64%20mozilla-beta%20leak%20test%20build/builds/0 It is suspected this was due to a malformed directory in the PATH. I have updated the path to include C:\mozilla-build\buildbotve\scripts and recaptured the base image: b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER9-2015-11-13 (ami-563b403c). There are now 4 instances up based on this image, b-2008-spot-002, 004 through 6. Most recent PGO build failed despite the changing the instance type to c3.2xlarge: http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%206.1%20x86-64%20mozilla-beta%20leak%20test%20build/builds/0 Outside of what has been mention above, rest of builds are mainly coming up green.
Correction: Most recent PGO build failed despite the changing the instance type to c3.2xlarge: http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20fx-team%20pgo-build/builds/0
Catlee pointed out in the log from comment 5: 12:08:16 INFO - Warning: C4146 in c:\builds\moz2_slave\fx-team-w32-pgo-00000000000000\build\src\js\src\jit\x86-shared\AtomicOperations-x86-shared.h: unary minus operator applied to unsigned type, result still unsigned 12:08:16 INFO - c:\builds\moz2_slave\fx-team-w32-pgo-00000000000000\build\src\js\src\jit/x86-shared/AtomicOperations-x86-shared.h(522) : warning C4146: unary minus operator applied to unsigned type, result still unsigned 12:08:16 INFO - c:/builds/moz2_slave/fx-team-w32-pgo-00000000000000/build/src/js/src/jsgc.cpp(4801) : fatal error C1088: Cannot flush compiler intermediate file: 'C:/Users/cltbld/AppData/Local/Temp\_CL_6070deebsy': No space left on device 12:08:16 INFO - Unified_cpp_js_src24.cpp Which is strange because there are 20+ gigs available on the machine.
Oddly enough it does seems like the instance had run out of space. Moving forward from here, in attempt not use a larger initial drive, I am going to clean up the instance, increase the purge build target to 30 to 35, and see how that works out.
The AMI now has 50+ gigs of free space, and the purge build target is now 35 gigs. I am now spinning up b-2008-spot 002 through 006. AMI ID b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER12-2015-11-16 (ami-fab5cf90)
With the new AMI there has been no failures on 003, 004, or 005. The other 2 did have failing builds. 3 failure on b-2008-spot-002. All with this error: 14:53:56 INFO - nsWin32Locale.cpp 14:53:56 INFO - python2.7 c:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/sccache/sccache.py cl -FoUnified_cpp_ipc_glue0.obj -c -I../../dist/stl_wrappers -DWIN32_LEAN_AND_MEAN -D_WIN32 -DWIN32 -D_CRT_RAND_S -DCERT_CHAIN_PARA_HAS_EXTRA_FIELDS -DOS_WIN=1 -D_UNICODE -DCHROMIUM_BUILD -DU_STATIC_IMPLEMENTATION -DUNICODE -D_WINDOWS -D_SECURE_ATL -DCOMPILER_MSVC -DMOZ_CHILD_PROCESS_NAME='"plugin-container.exe"' -DMOZ_CHILD_PROCESS_NAME_PIE='""' -DMOZ_CHILD_PROCESS_BUNDLE='"plugin-container.app/Contents/MacOS/"' -DDLL_PREFIX='""' -DDLL_SUFFIX='".dll"' -DSTATIC_EXPORTABLE_JS_API -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/ipc/glue -I. -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/caps -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/dom/broadcastchannel -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/dom/indexedDB -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/dom/workers -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/media/webrtc/trunk -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/xpcom/build -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/toolkit/xre -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/xpcom/threads -I../../ipc/ipdl/_ipdlheaders -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/ipc/chromium/src -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/ipc/glue -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/toolkit/crashreporter -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/security/sandbox/chromium -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/security/sandbox/chromium-shim -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/security/sandbox/win/src/sandboxbroker -I../../dist/include -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/obj-firefox/dist/include/nspr -Ic:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/obj-firefox/dist/include/nss -MD -FI ../../dist/include/mozilla-config.h -DMOZILLA_CLIENT -deps.deps/Unified_cpp_ipc_glue0.obj.pp -TP -nologo -D_HAS_EXCEPTIONS=0 -W3 -Gy -arch:IA32 -FS -wd4251 -wd4244 -wd4267 -wd4345 -wd4351 -wd4800 -wd4819 -we4553 -GR- -DNDEBUG -DTRIMMED -Z7 -UDEBUG -DNDEBUG -O1 -Oi -Oy- -WX c:/builds/moz2_slave/m-in-w32-000000000000000000000/build/src/obj-firefox/ipc/glue/Unified_cpp_ipc_glue0.cpp 14:53:56 INFO - Creating library ../lib/icuin.lib and object ../lib/icuin.exp 14:53:57 INFO - CVTRES : fatal error CVT1107: 'c:\builds\moz2_slave\m-in-w32-000000000000000000000\build\src\obj-firefox\intl\icu\target\i18n\utf16collationiterator.o' is corrupt 14:53:57 INFO - LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt 14:53:57 INFO - Makefile:185: recipe for target '../lib/icuin55.dll' failed 14:53:57 INFO - mozmake.EXE[7]: *** [../lib/icuin55.dll] Error 99 b-2008-spot-006 had a build with warnings: http://buildbot-master70.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-beta%20build/builds/0 . It failed on sendchange: Unable to successfully run ['buildbot', 'sendchange', '--master', 'buildbot-master81.build.mozilla.org:9301', '--username', 'sendchange-unittest', '--branch', 'mozilla-beta-win32-pgo-unittest', '--revision', 'c66289e84c50', '--comments', 'Bug 1217047 - try harder in IsContractIDRegistered to return a reasonable answer_ r=bsmedberg,f=yury a=lizzard', '--property', 'buildid:20151116124336', '--property', 'pgo_build:True', '--property', 'builduid:65a1d021a6db4a43a6652e11a0d549fe', 'http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-win32/1447706616/firefox-43.0.en-US.win32.zip', 'http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-win32/1447706616/firefox-43.0.en-US.win32.web-platform.tests.zip'] after 5 attempts However sendchange did work when the command was ran manually. A re-run of the build is currently pending. I am going to wait on that before diving into this deeper.
Q, I think you have dealt with this in the past: > 14:53:57 INFO - CVTRES : fatal error CVT1107: > 'c:\builds\moz2_slave\m-in-w32-000000000000000000000\build\src\obj- > firefox\intl\icu\target\i18n\utf16collationiterator.o' is corrupt > 14:53:57 INFO - LINK : fatal error LNK1123: failure during conversion > to COFF: file invalid or corrupt > 14:53:57 INFO - Makefile:185: recipe for target '../lib/icuin55.dll' > failed I thought i had this worked around with this: http://hg.mozilla.org/build/puppet/file/tip/modules/tweaks/manifests/vs_2013_lnk.pp Is there a piece I am missing?
Flags: needinfo?(q)
You're missing what that failure is saying: the run before those three, 002 disconnected in the middle of a build, in fact, in the middle of building ICU. You don't want that to happen, because our build system sucks, but not as badly as ICU's build system sucks. The existing solution for that, depending on who noticed it and what time of day it was, would be to either disable the slave and have releng either remove the objdir or (more likely since it's one button click) reimage the slave, or, use the clobberer to remove the m-i opt objdir from every slave. Since I saw this one and have a deep fondness for terminating things that are making a mess of production jobs, I gave it a touch of the Terminate button on https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=build&type=b-2008-spot&name=b-2008-spot-002, and then began feeling a bit guilty that it might not be as automatically regenerated as production AWS instances are.
Flags: needinfo?(q)
Ah OK. I will blow that instance away and recreate it in the am.
I am going to add yasm 1.3 and new GLS key to this testing image.
New AMI: b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER13-2015-11-17 (ami-8d2560e7)
the b-2008-spot stuff in cloud-tools is all ready to go. just a case of enabling instances in slavealloc and they should spin up within 30 ~ 60 minutes depending on load. You may have to disable some ix instances if there isn't enough load. You have to address the individual instances in slavealloc (or go direct to mysql), because the group pages will only populate after some builds have completed, eg: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-spot-101 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-spot-102 https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=b-2008-spot-103 etc
note that spot: 001 to 100 are in us-east-1 101 to 200 are in us-west-2 I would enable instances in both regions to get the most thorough test results.
For the time being I am holding off on the additional roll out. I am seeing this error pop on multiple builds on multiple instances this morning: 14:33:11 INFO - mozmake.EXE[2]: Leaving directory 'c:/builds/moz2_slave/date-w32-000000000000000000000/build/src/obj-firefox/browser/locales' 14:33:11 INFO - mozmake.EXE: Leaving directory 'c:/builds/moz2_slave/date-w32-000000000000000000000/build/src/obj-firefox' 15:53:11 INFO - Automation Error: mozprocess timed out after 4800 seconds running ['c:\\mozilla-build\\python27\\python.exe', 'mach', '--log-no-times', 'build', '-v'] 15:53:11 ERROR - timed out after 4800 seconds of no output 15:53:11 ERROR - Return code: 572 15:53:11 WARNING - setting return code to 2 15:53:11 FATAL - 'mach build' did not run successfully. Please check log for errors. 15:53:11 FATAL - Running post_fatal callback... http://buildbot-master71.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20date%20build/builds/0 http://buildbot-master77.bb.releng.use1.mozilla.com:8001/builders/WINNT%206.1%20x86-64%20mozilla-inbound%20build/builds/3 http://buildbot-master94.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20leak%20test%20build/builds/1 http://buildbot-master94.bb.releng.use1.mozilla.com:8001/builders/Win32%20Mulet%20mozilla-central%20nightly/builds/0 http://buildbot-master71.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20mozilla-inbound%20pgo-build/builds/0 http://buildbot-master71.bb.releng.use1.mozilla.com:8001/builders/WINNT%205.2%20date%20build/builds/0
Enabled share hg extension and recaptured the image: b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER14-2015-11-19 (ami-0f5d1965)
Disabled the Windows Defender service to hopefully cut down on some of the time outs: b-2008-VAN-BASE-SDK-VS-postPUPPET-GOODBASE-VER15-2015-11-20 (ami-cdeeaaa7)
With the latest AMI builds have been mostly green or failing on the known issue of get mar. There is one build that failed for another reason: http://buildbot-master94.bb.releng.use1.mozilla.com:8001/builders/mozilla-release%20hg%20bundle/builds/0 downloading bundle https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/releases/mozilla-release/c7aea035aad1b34bbd492f5cd2236ad6e318c5a5.stream-legacy.hg streaming all changes 215964 files to transfer, 1.54 GB of data transferred 1.54 GB in 236.7 seconds (6.66 MB/sec) finishing applying bundle; pulling searching for changes no changes found updating to branch default 125848 files updated, 0 files merged, 0 files removed, 0 files unresolved 291344 changesets found scp: /home/ftp/pub/firefox/bundles/mozilla-release.hg.upload: No such file or directory program finished with exit code 1
Mozilla-beta builds are failing on sendchange. Looking into this I found that the PATH environment Buildbot is using does not match the instances default value and is malformed (missing \ in various locations): PATH=C:\mozilla-buildpython27;C:\mozilla-buildbuildbotve\scripts;C:\mozilla-build\msys\local\bin;c:\mozilla-build\wget;c:\mozilla-build\7zip;c:\mozilla-build\blat261\full;c:\mozilla-build\python;c:\mozilla-build\svn-win32-1.6.3\bin;c:\mozilla-build\upx203w;c:\mozilla-build\emacs-24.3\bin;c:\mozilla-build\info-zip;c:\mozilla-build\nsis-2.46u;c:\mozilla-build\nsis-3.0a2;c:\mozilla-build\wix-351728;c:\mozilla-build\hg;c:\mozilla-build\python\Scripts;c:\mozilla-build\kdiff3;c:\mozilla-build\yasm;c:\mozilla-build\mozmake;.;C:\mozilla-build\msys\local\bin;C:\mozilla-build\msys\mingw\bin;C:\mozilla-build\msys\bin;c:\Program Files (x86)\Puppet Labs\Puppet\puppet\bin;c:\Program Files (x86)\Puppet Labs\Puppet\facter\bin;c:\Program Files (x86)\Puppet Labs\Puppet\hiera\bin;c:\Program Files (x86)\Puppet Labs\Puppet\bin;c:\Program Files (x86)\Puppet Labs\Puppet\sys\ruby\bin;c:\Program Files (x86)\Puppet Labs\Puppet\sys\tools\bin;c:\Windows\system32;c:\Windows;c:\Windows\System32\Wbem;c:\Windows\System32\WindowsPowerShell\v1.0\;c:\Windows\System32\WindowsPowerShell\v1.0\;c:\Program Files\Amazon\cfn-bootstrap\;c:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\;c:\Program Files (x86)\Microsoft SQL Server\100\Tools\Binn\;c:\Program Files\Microsoft SQL Server\100\Tools\Binn\;c:\Program Files\Microsoft SQL Server\100\DTS\Binn\;c:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\;c:\Program Files\Microsoft SQL Server\110\Tools\Binn\;c:\Program Files (x86)\Microsoft SDKs\TypeScript\1.0\;c:\Program Files (x86)\Puppet Labs\Puppet\bin;c:\opt\runner;c:\mozilla-build\buildbotve;c:\mozilla-build\python27;C:\mozilla-build\msys\mingw\bin;C:\mozilla-build\msys\bin;C:\mozilla-build\msys\local\bin;.;C:\mozilla-build\msys\local\bin;c:\mozilla-build\moztools-x64\bin;c:\mozilla-build\vim\vim72 I am not sure where it is getting this value at or why it is different from the IX machines.
Grenade has hacked around this for now in the usedata: + Create-SymbolicLink -link 'C:\mozilla-buildpython27' -target 'C:\mozilla-build\python27' + Create-SymbolicLink -link 'C:\mozilla-buildbuildbotve' -target 'C:\mozilla-build\buildbotve' hopefully that will work as temp work around until we can figure how buildbot is getting the value mention above.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.