Open Bug 915212 Opened 6 years ago Updated 4 months ago

Intermittent mochitest "TEST-UNEXPECTED-FAIL | runtests.py | Timed out while waiting for server startup."

Categories

(Testing :: Mochitest, defect)

defect
Not set

Tracking

(Not tracked)

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell unknown])

https://tbpl.mozilla.org/php/getParsedLog.php?id=27658162&tree=Mozilla-Inbound

Ubuntu ASAN VM 12.04 x64 mozilla-inbound opt test mochitest-4 on 2013-09-10 13:22:54 PDT for push 4f3fe05d6dc8
slave: tst-linux64-ec2-076

13:29:19     INFO - Calling ['/builds/slave/test/build/venv/bin/python', '-u', '/builds/slave/test/build/tests/mochitest/runtests.py', '--appname=/builds/slave/test/build/application/firefox/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64-asan/1378841630/firefox-26.0a1.en-US.linux-x86_64-asan.crashreporter-symbols.zip', '--certificate-path=tests/certs', '--autorun', '--close-when-done', '--console-level=INFO', '--setpref=webgl.force-enabled=true', '--total-chunks=5', '--this-chunk=4', '--chunk-by-dir=4'] with output_timeout 1000
13:29:19     INFO -  INFO | automation.py | ASan running in mid-memory configuration
13:29:20     INFO -  INFO | automation.py | ASan running in mid-memory configuration
13:29:20     INFO -  INFO | automation.py | Launching: /builds/slave/test/build/tests/bin/xpcshell -g /builds/slave/test/build/application/firefox -v 170 -f /builds/slave/test/build/tests/mochitest/httpd.js -e "const _PROFILE_PATH = '/tmp/tmpNlt3s9'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;" -f ./server.js
13:29:20     INFO -  Mochitest INFO | runtests.py | Server pid: 2607
13:30:50     INFO -  Mochitest ERROR | Timed out while waiting for server startup.
13:30:50    ERROR - Return code: 256
It's kind of hard to tell (because we drop the xpcshell output) whether this is just "xpcshell takes forever to startup on ASAN builds" or "xpcshell is erroring out in ASAN builds".
Summary: Intermittent ASAN mochitest "Return code: 256" after "Timed out while waiting for server startup" error → Intermittent Linux mochitest "Return code: 256" after "Timed out while waiting for server startup" error
Depends on: 917817
Some of these have:
17:18:44     INFO -  WARNING: failed to bind socket: file ../../../../netwerk/base/src/nsServerSocket.cpp, line 352

...which is interesting. What's using that port? I thought we rebooted test slaves after runs. Maybe that's no longer true with EC2 slaves?
...note that nsServerSocket already uses PR_SockOpt_Reuseaddr:
http://mxr.mozilla.org/mozilla-central/source/netwerk/base/src/nsServerSocket.cpp#342

so I don't think it's a case of just something else having had that port in use recently. Something must actively be listening on that port.
Summary: Intermittent Linux mochitest "Return code: 256" after "Timed out while waiting for server startup" error → Intermittent Linux mochitest "TEST-UNEXPECTED-FAIL | runtests.py | Timed out while waiting for server startup."
Every time I look, the run before the one that timed out waiting for server startup is one that was cancelled through self-serve.
That sounds like a smoking gun to me. That plus what I mentioned in comment 7 and comment 8 indicates that things aren't being cleaned up properly after a cancelled job.

catlee: is this a known problem? If not, we should fix that.
Flags: needinfo?(catlee)
Seeing this permaorange on tegra-057 (bug 778920)
Tentatively moving to RelEng per comment 15, comment 16.
Component: Mochitest → General Automation
Product: Testing → Release Engineering
QA Contact: catlee
Summary: Intermittent Linux mochitest "TEST-UNEXPECTED-FAIL | runtests.py | Timed out while waiting for server startup." → Intermittent mochitest "TEST-UNEXPECTED-FAIL | runtests.py | Timed out while waiting for server startup."
Blocks: 919920
So when buildbot interrupts a job, it sends SIGKILL to the whole process group. I believe jobs that normally reboot after should still reboot after being interrupted this way, but Callek thought perhaps this wasn't happening. Some pointers to logs of interrupted builds would be helpful.

What platforms is this happening on?

Is the test harness somehow creating a new process group for the test server so that it doesn't get the KILL signal sent do it?

It's also possible that the socket lingers after the process shuts down. See the tail end of http://docs.python.org/2.7/library/socket.html#example. Are we setting SO_REUSEADDR on the socket? Can we choose a port number dynamically?
Flags: needinfo?(catlee)
There's actually at least three things precipitating it, not just the interrupts I'd seen as of comment 15. For Android, the most annoying because it typically breaks several jobs before it heals, it's the "2400 seconds without output, attempting to kill" failures. For desktop, it's both cancelled jobs and the constant stream of disconnects.

But none of those things are new: Android has been timing out without output for as long as we've run it, we've been cancelling jobs since the minute we got the opportunity, and we've had slews of disconnects since we first moved buildbot masters into a different dc than the slaves.

So what changed for mochitest server startup on 2013-09-11 for desktop, and 2013-09-24 for mobile (and possibly, though it doesn't seem to be frequent enough to be sure of the start time, 2013-09-26 for b2g emulator)?
Flags: needinfo?(ted)
And because we can never have too many things all jammed together, comment 109 is a mochitest-other where mochitest-a11y hit an assertion, failed (without any useful message), and then mochitest-plugins timed out waiting for server startup.
Component: General Automation → Mochitest
OS: Linux → All
Product: Release Engineering → Testing
QA Contact: catlee
Hardware: x86_64 → All
Version: unspecified → Trunk
(In reply to Chris AtLee [:catlee] from comment #106)
> It's also possible that the socket lingers after the process shuts down. See
> the tail end of http://docs.python.org/2.7/library/socket.html#example. Are
> we setting SO_REUSEADDR on the socket? Can we choose a port number
> dynamically?

Per comment 8 we already use SO_REUSEADDR. If we're rebooting between jobs then I wouldn't expect this to matter, since everything should be wiped out.

(In reply to Phil Ringnalda (:philor) from comment #108)
> So what changed for mochitest server startup on 2013-09-11 for desktop, and
> 2013-09-24 for mobile (and possibly, though it doesn't seem to be frequent
> enough to be sure of the start time, 2013-09-26 for b2g emulator)?

Nothing in the harness changed around 9/11:
http://hg.mozilla.org/mozilla-central/filelog/3ad5ecf34c9e/testing/mochitest/runtests.py
http://hg.mozilla.org/mozilla-central/filelog/cc4a3f3f899e/build/automation.py.in

That being said, it could be a change in a myriad of places if it's actually a product/harness bug, I wouldn't know where to start to try to narrow that down.
Flags: needinfo?(ted)
So, given the correlation of this happening after canceling a job there is a possible issue with android there. I believe (and Callek/Kmoir can correct me if this has changed) that the http server that mochitest uses on android uses the same port based on the tegra/panda that the bbot slave process is using. So, if bbot slave X is using Tegra Y then it uses port nnnnn on the foopy where xpcshell is running httpd.js for mochitest.

If we cancel a job, does that process on the foopy get axed and is that port freed up?

That of course does nothing to explain why we also see this on desktop tests. :-(

Need info'ing callek to see if he can confirm whether canceling the job ensures the xpcshell process on the foopy is terminated.
Flags: needinfo?(bugspam.Callek)
(In reply to Clint Talbert ( :ctalbert ) from comment #229)
> So, given the correlation of this happening after canceling a job there is a
> possible issue with android there. I believe (and Callek/Kmoir can correct
> me if this has changed) that the http server that mochitest uses on android
> uses the same port based on the tegra/panda that the bbot slave process is
> using. So, if bbot slave X is using Tegra Y then it uses port nnnnn on the
> foopy where xpcshell is running httpd.js for mochitest.
> 
> If we cancel a job, does that process on the foopy get axed and is that port
> freed up?
> 
> That of course does nothing to explain why we also see this on desktop
> tests. :-(
> 
> Need info'ing callek to see if he can confirm whether canceling the job
> ensures the xpcshell process on the foopy is terminated.

In most cases buildbot will auto-kill spawned subprocs.

In some rare cases buildbot will miss a proc so we have code that does a cleanup audit of procs from older runs.

In even rarer cases [for now] we have a bug that fails to recognize some of this, but I have a patch in my review queue (as of today) that should fix that.

There is also the port thing as you describe, specifically see the mapping in python:

'tegra_android': dict([('tegra-%03i' % x, {'http_port': '30%03i' % x, 'ssl_port': '31%03i' % x}) \
   ....

and 
'panda_android': dict(
        [('panda-%04i' % x, {'http_port': '30%03i' % x, 'ssl_port': '31%03i' % x}) \
   ...

at http://mxr.mozilla.org/build/source/buildbot-configs/mozilla-tests/production_config.py#19
Flags: needinfo?(bugspam.Callek)