Closed Bug 560056 Opened 15 years ago Closed 14 years ago

Frequent "uncaught exception: 2147746065" and "Timed out while waiting for server startup" running mochitests on Rev3 darwin machines

Categories

(Release Engineering :: General, defect, P5)

x86_64
macOS
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [talos])

This may very well be a code problem, but nobody's likely to know _what_ problem without some releng help first. From looking back at a day or so of the mochitests I just made visible on tinderbox (3, 4, 5, opt-only, since that's what's kind of greenish): http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271418870.1271419149.1776.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-3/5 on 2010/04/16 04:54:30 s: talos-r3-snow-011 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271434132.1271434313.13855.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-2/5 on 2010/04/16 09:08:52 s: talos-r3-snow-010 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271434771.1271435643.19212.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/16 09:19:31 s: talos-r3-snow-004 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271443692.1271443839.8003.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-2/5 on 2010/04/16 11:48:12 s: talos-r3-snow-008 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271444741.1271444889.10708.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-4/5 on 2010/04/16 12:05:41 s: talos-r3-snow-014 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271453722.1271454252.6113.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitest-other on 2010/04/16 14:35:22 s: talos-r3-snow-010 In every case, the log is the same except what fraction of 92 seconds it takes to get to the timeout: python mochitest/runtests.py --appname=./Minefield.app/Contents/MacOS/firefox-bin --utility-path=bin --extra-profile-file=bin/plugins --certificate-path=certs --autorun --close-when-done --console-level=INFO --symbols-path=symbols --chrome ... INFO | runtests.py | Server pid: 521 uncaught exception: 2147746065 Timed out while waiting for server startup. program finished with exit code 1 elapsedTime=92.593459
Hi Phil, The only similarity that I have seen is that the web server gets a pid higher than 500. Every time this happens it seems to time out. Any pid less than 500 seems to have worked.
For context, I have seen these exceptions only to happen in Mo 2, 4 & other and only in this last commit it actually happened in all these 3 suites.
Logs from my just-filed duplicate bug 560292 (the logs Armen refers to in comment 3): http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271690054.1271690507.15701.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-4/5 on 2010/04/19 08:14:14 s: talos-r3-snow-015 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271690031.1271690507.15703.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-2/5 on 2010/04/19 08:13:51 s: talos-r3-snow-017 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271690039.1271690511.15709.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-3/5 on 2010/04/19 08:13:59 s: talos-r3-snow-007 (thanks for catching the dupe, philor -- I neglected to search bugzilla before filing. Bad dholbert.)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271720150.1271722506.6828.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-2/5 on 2010/04/19 16:35:50 s: talos-r3-snow-012 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271720391.1271722435.6549.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-3/5 on 2010/04/19 16:39:51 s: talos-r3-snow-003
Whiteboard: [orange] → [orange][talos]
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271795056.1271795935.26208.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-3/5 on 2010/04/20 13:24:16 s: talos-r3-snow-017
The failing pid in these logs are again higher than 500 (this doesn't mean it is the reason): INFO | runtests.py | Server pid: 4971 INFO | runtests.py | Server pid: 559 INFO | runtests.py | Server pid: 4461 INFO | runtests.py | Server pid: 849 INFO | runtests.py | Server pid: 882 INFO | runtests.py | Server pid: 576 Tests suites that have failed so far are 2, 3, 4 & other.
I've just seen what looks like the same failure on a couple of 10.5.8 mochitest runs: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271840801.1271842216.20839.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-2/5 on 2010/04/21 02:06:41 s: talos-r3-leopard-004 INFO | runtests.py | Server pid: 484 uncaught exception: 2147746065 Timed out while waiting for server startup. program finished with exit code 1 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1271840825.1271842216.20840.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-3/5 on 2010/04/21 02:07:05 s: talos-r3-leopard-030 INFO | runtests.py | Server pid: 498 uncaught exception: 2147746065 Timed out while waiting for server startup. program finished with exit code 1
So it seems that this happens once in a while for both 10.5 and 10.6. My theory of the pid is now 480 and higher. I think this should not be too difficult to try if most of us have 10.6 machines. Does anyone have time to give it a try? For reference the code being run is: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py.in#329 and times out waiting for an alive file to exist. os.path.exists(aliveFile): Maybe increasing the timeout will solve this.
Summary: Frequent "uncaught exception: 2147746065" and "Timed out while waiting for server startup" running OS X 10.6 mochitests → Frequent "uncaught exception: 2147746065" and "Timed out while waiting for server startup" running mochitests on Rev3 darwin machines
I'm really curious to hear a theory about why the PID could be a factor.
Assuming the OS simply increments the PIDs it hands out for new processes I can only take it as an indication that we've run more processes. So either the OS itself has run some periodic job, or the reboot after the previous job failed.
http://tinderbox.mozilla.org/Firefox/1272212417.1272212744.21340.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-2/5 on 2010/04/25 09:20:17 s: talos-r3-snow-009 INFO | runtests.py | Server pid: 417 uncaught exception: 2147746065 Timed out while waiting for server startup.
(In reply to comment #12) > uncaught exception: 2147746065 NS_ERROR_NOT_AVAILABLE, FWIW.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1272631219.1272631618.23647.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-3/5 on 2010/04/30 05:40:19 uncaught exception: 2147746065 Timed out while waiting for server startup. program finished with exit code 1 TinderboxPrint: mochitest-plain-3<br/><em class="testfail">T-FAIL</em>
And for the previous one: INFO | runtests.py | Server pid: 437 I'd expect the PIDs to be reasonably consistent most of the time if the machine is rebooting before each run.
http://tinderbox.mozilla.org/Firefox/1272834053.1272834532.11327.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-5/5 on 2010/05/02 14:00:53 s: talos-r3-leopard-029 INFO | runtests.py | Server pid: 478 uncaught exception: 2147746065 Timed out while waiting for server startup.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1272834753.1272835520.13523.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test mochitests-5/5 on 2010/05/02 14:12:33 s: talos-r3-leopard-009 INFO | runtests.py | Server pid: 600 ... WARNING: failed to bind socket: file /builds/moz2_slave/mozilla-central-macosx-debug/build/netwerk/base/src/nsServerSocket.cpp, line 317 uncaught exception: 2147746065 ... Timed out while waiting for server startup.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273064040.1273064719.12628.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitest-other on 2010/05/05 05:54:00 s: talos-r3-leopard-027 INFO | runtests.py | Server pid: 420 uncaught exception: 2147746065 Timed out while waiting for server startup.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273097792.1273098637.4556.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test mochitests-2/5 on 2010/05/05 15:16:32 s: talos-r3-leopard-039 INFO | runtests.py | Server pid: 396 WARNING: failed to bind socket: file /builds/slave/mozilla-central-macosx-debug/build/netwerk/base/src/nsServerSocket.cpp, line 317 uncaught exception: 2147746065 Timed out while waiting for server startup.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273097798.1273099494.15904.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test mochitest-other on 2010/05/05 15:16:38 s: talos-r3-leopard-004 INFO | runtests.py | Server pid: 399 WARNING: failed to bind socket: file /builds/slave/mozilla-central-macosx-debug/build/netwerk/base/src/nsServerSocket.cpp, line 317 uncaught exception: 2147746065 Timed out while waiting for server startup.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273184293.1273184612.14635.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-3/5 on 2010/05/06 15:18:13 s: talos-r3-snow-019
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273184666.1273185114.21540.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-4/5 on 2010/05/06 15:24:26 s: talos-r3-leopard-023
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273233295.1273233490.30569.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-4/5 on 2010/05/07 04:54:55 s: talos-r3-leopard-008
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273862361.1273862806.28970.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central debug test mochitests-3/5 on 2010/05/14 11:39:21 s: talos-r3-snow-020
Based on the frequency of this I'm lowering the priority.
Priority: -- → P5
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274369353.1274369894.17400.gz This has happened three times on a single check-in so far.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1286505654.1286505947.29917.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central debug test mochitests-3/5 on 2010/10/07 19:40:54 s: talos-r3-snow-017 INFO | runtests.py | Server pid: 394 WARNING: Error reading channel-prefs.js: file /builds/slave/mozilla-central-macosx64-debug/build/modules/libpref/src/nsPrefService.cpp, line 922 nsNativeModuleLoader::LoadModule("/Users/cltbld/talos-slave/mozilla-central_snowleopard-debug_test-mochitests-3/build/MinefieldDebug.app/Contents/MacOS/components/libxpcomsample.dylib") - load FAILED, rv: 80004005, error: <unknown; can't get error from NSPR> WARNING: failed to bind socket: file /builds/slave/mozilla-central-macosx64-debug/build/netwerk/base/src/nsServerSocket.cpp, line 317 uncaught exception: 2147746065 Timed out while waiting for server startup.
Can we fix xpcshell to print a stack trace when it handles an uncaught exception? That might make this a bit easier to diagnose.
At least in this most recent failure, it's xpcshell failing to bind the socket.
http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1288896208.1288896391.17021.gz Rev3 MacOSX Leopard 10.5.8 tracemonkey opt test mochitests-3/5 on 2010/11/04 11:43:28 s: talos-r3-leopard-028 INFO | runtests.py | Server pid: 506 uncaught exception: 2147746065 Timed out while waiting for server startup.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1289276044.1289276328.20053.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-2/5 on 2010/11/08 20:14:04 s: talos-r3-leopard-008 INFO | runtests.py | Server pid: 391 uncaught exception: 2147746065 Timed out while waiting for server startup.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1289276044.1289276986.23005.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitest-other on 2010/11/08 20:14:04 s: talos-r3-leopard-028 INFO | runtests.py | Server pid: 390 uncaught exception: 2147746065 Timed out while waiting for server startup. Particularly bad as a part of mochitest-other, since this doesn't put anything into the brief log, so with mochitest-chrome hitting this and mochitest-browser-chrome hitting a common intermittent orange, it got double-starred as just the browser-chrome failure.
I usually see this when xpcshell is already running. Could it be that we are running two test suites (mochitest-chrome, browser-chrome, etc...) on the same machine and for some reason we are not cleaning up properly before we run the second test suite?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1296176355.1296178136.12534.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central debug test mochitests-3/5 on 2011/01/27 16:59:15 s: talos-r3-snow-018
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1297190320.1297192779.7408.gz Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test mochitest-other on 2011/02/08 10:38:40 s: talos-r3-leopard-036
Dunno about any of the other instances, but comment 42 was the result of killing a previous test run, which skips the reboot step, so httpd.js is still running from that and can't start up again.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1297194573.1297194873.16591.gz Rev3 MacOSX Snow Leopard 10.6.2 mozilla-central opt test mochitests-1/5 on 2011/02/08 11:49:33 s: talos-r3-snow-024
So, we had one thing through last November, when something got fixed (I wanted to pin it on the Bluetooth dialog, but that was the right day but in December, not November), then we had the period when we were killing test jobs and that wasn't running the reboot step, and now we don't have either one.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Whiteboard: [orange][talos] → [talos]
Product: mozilla.org → Release Engineering
Still happening with Thunderbird 38.1.0 Timestamp: Tue, August, 04, 08, 2015 3:19:03 AM Error: uncaught exception: 2147746065
(In reply to Erdin Kacan from comment #46) > Still happening with Thunderbird 38.1.0 > > Timestamp: Tue, August, 04, 08, 2015 3:19:03 AM > Error: uncaught exception: 2147746065 Erdin, is the next line of your error text the same as the reporter's? Timed out while waiting for server startup.
Flags: needinfo?(philringnalda)
QA Contact: mshal
Flags: needinfo?(philringnalda)
You need to log in before you can comment on or make changes to this bug.