Closed Bug 779159 Opened 12 years ago Closed 12 years ago

mw32-ix-slave* connections to buildbot master are dying during linking

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

x86
Windows 7
task
Not set
critical

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: emorley, Unassigned)

References

Details

First build this occurred on was:
https://tbpl.mozilla.org/?tree=Mozilla-Esr10&onlyunstarred=1&rev=d07051929ca3
-> (mw32-ix-slave13) https://tbpl.mozilla.org/php/getParsedLog.php?id=13963505&tree=Mozilla-Esr10
-> (mw32-ix-slave15, free space clobber) https://tbpl.mozilla.org/php/getParsedLog.php?id=13970058&tree=Mozilla-Esr10
-> We then somehow got a green on the next push

That changeset doesn't seem like it could have broken the build, unless I'm missing something?
https://hg.mozilla.org/releases/mozilla-esr10/rev/d07051929ca3

All other builds since then have failed, even after clobbering.

Win Nightly also failed:
https://tbpl.mozilla.org/php/getParsedLog.php?id=14002091&tree=Mozilla-Esr10

All with:

{
make -C toolkit/library libs
make[6]: Entering directory `/e/builds/moz2_slave/m-esr10-w32/build/obj-firefox/toolkit/library'
d:/mozilla-build/python25/python2.5.exe /e/builds/moz2_slave/m-esr10-w32/build/config/pythonpath.py -I../../config /e/builds/moz2_slave/m-esr10-w32/build/config/expandlibs_exec.py --uselist -- d:/mozilla-build/python25/python2.5.exe /e/builds/moz2_slave/m-esr10-w32/build/build/link.py /e/builds/moz2_slave/m-esr10-w32/build/obj-firefox/toolkit/library/linker-vsize link -NOLOGO -DLL -OUT:xul.dll -PDB:xul.pdb -SUBSYSTEM:WINDOWS  dlldeps-xul.obj nsStaticXULComponents.obj nsDllMain.obj nsGFXDeps.obj dlldeps-zlib.obj nsUnicharUtils.obj nsBidiUtils.obj nsRDFResource.obj   ./module.res -LARGEADDRESSAWARE -NXCOMPAT -DYNAMICBASE -SAFESEH  -DEBUG -DEBUGTYPE:CV -DEBUG -OPT:REF -LTCG:PGUPDATE   -LIBPATH:../../dist/lib -NODEFAULTLIB:msvcrt -NODEFAULTLIB:msvcrtd -NODEFAULTLIB:msvcprt -NODEFAULTLIB:msvcprtd -DEFAULTLIB:mozcrt  ../../toolkit/xre/xulapp_s.lib  ../../staticlib/components/necko.lib ../../staticlib/components/uconv.lib ../../staticlib/components/i18n.lib ../../staticlib/components/chardet.lib ../../staticlib/components/jar50.lib ../../staticlib/components/startupcache.lib ../../staticlib/components/pref.lib ../../staticlib/components/htmlpars.lib ../../staticlib/components/imglib2.lib ../../staticlib/components/gkgfx.lib ../../staticlib/components/gklayout.lib ../../staticlib/components/docshell.lib ../../staticlib/components/embedcomponents.lib ../../staticlib/components/webbrwsr.lib ../../staticlib/components/nsappshell.lib ../../staticlib/components/txmgr.lib ../../staticlib/components/commandlines.lib ../../staticlib/components/toolkitcomps.lib ../../staticlib/components/pipboot.lib ../../staticlib/components/pipnss.lib ../../staticlib/components/appcomps.lib ../../staticlib/components/jsreflect.lib ../../staticlib/components/composer.lib ../../staticlib/components/jetpack_s.lib ../../staticlib/components/telemetry.lib ../../staticlib/components/jsdebugger.lib ../../staticlib/components/storagecomps.lib ../../staticlib/components/rdf.lib ../../staticlib/components/windowds.lib ../../staticlib/components/jsctypes.lib ../../staticlib/components/jsperf.lib ../../staticlib/components/gkplugin.lib ../../staticlib/components/windowsproxy.lib ../../staticlib/components/jsd.lib ../../staticlib/components/autoconfig.lib ../../staticlib/components/auth.lib ../../staticlib/components/cookie.lib ../../staticlib/components/permissions.lib ../../staticlib/components/universalchardet.lib ../../staticlib/components/places.lib ../../staticlib/components/tkautocomplete.lib ../../staticlib/components/satchel.lib ../../staticlib/components/pippki.lib ../../staticlib/components/imgicon.lib ../../staticlib/components/gkwidget.lib ../../staticlib/components/accessibility.lib ../../staticlib/components/spellchecker.lib ../../staticlib/components/zipwriter.lib ../../staticlib/components/services-crypto.lib ../../staticlib/jsipc_s.lib ../../staticlib/domipc_s.lib ../../staticlib/domplugins_s.lib ../../staticlib/mozipc_s.lib ../../staticlib/mozipdlgen_s.lib ../../staticlib/ipcshell_s.lib ../../staticlib/gfx2d.lib ../../staticlib/gfxipc_s.lib ../../staticlib/hal_s.lib ../../staticlib/xpcom_core.lib ../../staticlib/ucvutil_s.lib ../../staticlib/chromium_s.lib ../../staticlib/mozreg_s.lib ../../staticlib/thebes.lib ../../staticlib/ycbcr.lib ../../staticlib/angle.lib   ../../media/libjpeg/jpeg3250.lib ../../media/libpng/png.lib ../../gfx/qcms/mozqcms.lib e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/mozjs.lib e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/crmf.lib         e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/smime3.lib         e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/ssl3.lib         e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/nss3.lib         e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/nssutil3.lib ../../gfx/cairo/cairo/src/mozcairo.lib  ../../gfx/cairo/libpixman/src/mozlibpixman.lib ../../gfx/harfbuzz/src/mozharfbuzz.lib ../../gfx/ots/src/mozots.lib  ../../dist/lib/mozsqlite3.lib  ../../modules/zlib/src/mozz.lib   e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/nspr4.lib e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/plc4.lib e:/builds/moz2_slave/m-esr10-w32/build/obj-firefox/dist/lib/plds4.lib  ../../dist/lib/mozalloc.lib kernel32.lib user32.lib gdi32.lib winmm.lib wsock32.lib advapi32.lib shell32.lib ole32.lib uuid.lib version.lib winspool.lib comdlg32.lib imm32.lib winmm.lib wsock32.lib msimg32.lib shlwapi.lib psapi.lib ws2_32.lib dbghelp.lib wininet.lib  usp10.lib oleaut32.lib   
PGOMGR : warning PG0188: No .PGC files matching 'xul!*.pgc' were found.
warning C4743: 'const std::logic_error::`vftable'' has different size in 'e:\builds\moz2_slave\m-esr10-w32\build\toolkit\crashreporter\google-breakpad\src\common\windows\http_upload.cc' and 'e:\builds\moz2_slave\m-esr10-w32\build\toolkit\xre\nsWindowsDllBlocklist.cpp': 12 and 16 bytes
warning C4743: 'const std::length_error::`vftable'' has different size in 'e:\builds\moz2_slave\m-esr10-w32\build\toolkit\crashreporter\google-breakpad\src\common\windows\http_upload.cc' and 'e:\builds\moz2_slave\m-esr10-w32\build\toolkit\xre\nsWindowsDllBlocklist.cpp': 12 and 16 bytes
   Creating library xul.lib and object xul.exp
Generating code

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
}
"PGOMGR : warning PG0188: No .PGC files matching 'xul!*.pgc' were found." is not an error, it happens in every PGO build. We merge and delete the pgc files into the pgd file before linking, but the linker warns about that anyway.

The only thing I can think is that it's not generating output so buildbot is killing it? Not sure why we'd only see this on ESR, maybe beacuse it's still using VC 2005?
Ah ok, I haven't every noticed that in a log, sorry.

https://tbpl.mozilla.org/php/getParsedLog.php?id=14006787&tree=Mozilla-Esr10
s/every/ever/
Nothing build-related seems to be an error in that log. That, on the other hand:

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.

And I don't know anything in the build system using twisted.
Hi ho, hi ho, off to releng we go!

(Thank you for taking a look :-))
Component: Build Config → Release Engineering
Product: Core → mozilla.org
Summary: Permanent Win7 failure on esr10, with "PGOMGR : warning PG0188: No .PGC files matching 'xul!*.pgc' were found." → Permanent Win7 failure on esr10, with "Connection to the other side was lost in a non-clean fashion"
Version: Trunk → other
Tree closed; we've had 5 pushes without Windows 7 coverage.
Severity: critical → blocker
OS: All → Windows 7
Hardware: All → x86
At a glance, this seems highly unlikely that it's a configuration issue, but I'll look into it more deeply.
Assignee: nobody → bhearsum
Based on the fact that:
- Connections are consistently getting killed during idle parts of the job (eg, during linking) when there's no activity between the master and slave
- We had network maintenance over the weekend
- The first occurrence of this is on Monday

I think that something changed in the network configuration that is more aggressively killing "idle" connections. Can someone from IT look into whether or not that's plausible?
Assignee: bhearsum → server-ops
Component: Release Engineering → Server Operations
QA Contact: jdow
Summary: Permanent Win7 failure on esr10, with "Connection to the other side was lost in a non-clean fashion" → mw32-ix-slave* connections to buildbot master are dieing during linking
Can you specify some slave/master pairs where you're seeing this?
Assignee: server-ops → network-operations
Severity: blocker → critical
Component: Server Operations → Server Operations: Netops
QA Contact: jdow → ravi
Summary: mw32-ix-slave* connections to buildbot master are dieing during linking → mw32-ix-slave* connections to buildbot master are dying during linking
One example: buildbot-master30.srv.releng.scl3.mozilla.com and mw32-ix-slave15.build.mtv1.mozilla.com
we haven't changed any service time outs. what port is this communicating on?
(In reply to casey ransom [:casey] from comment #12)
> we haven't changed any service time outs. what port is this communicating on?

The slave connects to the master's port 9001 (tcp)
timeout should be extended again now. traffic from build to internal hosts was moved to a different zone on fw1.mtv1. the policy applied there was an 'any' policy which didn't include provisions for the longer timeout you require.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I've kicked some new jobs that I'll keep an eye on. These jobs take 3-4h to complete, so it'll be awhile before I can confirm that this fixed things. Thank you for the very quick response, though!
I rebooted all of the affected slaves last night to make sure they didn't have any stale state. After that, all the builds I kicked worked fine. Thanks very much for the quick response here.
Status: RESOLVED → VERIFIED
Assignee: network-operations → nobody
Component: Server Operations: Netops → Release Engineering
QA Contact: ravi
Assignee: nobody → network-operations
Component: Release Engineering → Server Operations: Netops
QA Contact: ravi
just moving it around to take off the esr tracking flag.
Blocks: 781784
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.