Closed Bug 650535 Opened 9 years ago Closed 9 years ago

Android tests fail with "Timed out while waiting for server startup"

Categories

(Release Engineering :: General, defect, P3)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: bear)

References

Details

(Keywords: intermittent-failure, Whiteboard: [mobile_unittests][android_tier_1])

Attachments

(1 file)

The pattern, for those needing to star something to be able to land or to stop someone yelling at you about your orange, is:

You have a log like http://tinderbox.mozilla.org/showlog.cgi?log=Mozilla-Aurora/1302957935.1302959478.28879.gz where a mochitest step died with

INFO | runtests.py | Server pid: 27278
uncaught exception: 2147746065
Timed out while waiting for server startup.

which tends to mean "the server returned NS_ERROR_NOT_AVAILABLE because it's still running from something that died before" so you look at https://build.mozilla.org/buildapi/recent/tegra-nnn with the nnn of your slave, look back at recent failures to find an orange result==1 on a crashtest that happened without any intervening mochitests between it and your run, use the name of the tree and the start time to figure out what run that would have been, and find a log like http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1302953854.1302957705.20676.gz where the crashtest run ended with

REFTEST TEST-START | Shutdown

INFO | automation.py | Application ran for: 0:08:32.671885
INFO | automation.py | Reading PID log: /tmp/tmp1mtwpHpidlog
WARNING | automationutils.processLeakLog() | refcount logging is off, so leaks can't be detected!

REFTEST INFO | runreftest.py | Running tests: end.

command timed out: 2400 seconds without output, killing pid 26215
process killed by signal 9
program finished with exit code -1

which seems to leave the server running, ready to blow up the next job.
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1302950674.1302952005.14010.gz is cute, since it's actually "crashtest killed by crashtest dying."
And http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303005755.1303006681.18626.gz as the other shoe dropping when 066 finally got another mochitest run to blow up.
Would have been more accurate if I filed "go ahead and write the patch to check for leftover stuff after anything bad happens like you already know you need to do" since http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303010194.1303011036.1211.gz seems to be a mochitest dying after http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303005755.1303007172.20105.gz which was a tsvg hitting bug 650018 after a pair of hg failures.
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303012774.1303013872.10694.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303012775.1303013570.9762.gz

Despite having what I thought was a world class appetite for spamming bugs, I don't think I can actually stand to copy-paste every instance of this, if it's really two or three or four every single m-c push.
Yeah, I think killing zombie daemons is the right long term approach.
From a single push (since it caught the nightly, and for some reason, maybe the m-c nightly, triggered three sets of tests), three mochitests, a crashtest, and a reftest.

http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303040375.1303041088.26418.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303041411.1303042486.3338.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303032814.1303033512.15925.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303032814.1303033837.16888.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1303040374.1303041214.27341.gz

If killing zombies is the right long term approach, what's the right short term approach? This absolutely blocks showing mobile results on the mozilla-central tree.
Whiteboard: [orange] → [orange][mobile_unittests]
(In reply to comment #8)
> If killing zombies is the right long term approach, what's the right short term
> approach? This absolutely blocks showing mobile results on the mozilla-central
> tree.

Most likely hiding the broken unit test results is the correct short term approach, until we can fix them.
we need to figure out a decent solution for this sooner than later.  The problem is when a test leaves a zombie xpcshell process running the next test that is run for that given machine will fail to startup.  This is partly because we use the same xpcshell httpd.js port for each tegra (based off the tegra number).  fwiw, I think that is a good approach to keep it simple.

A couple possible solutions:
1) use random dynamic ports, not based on tegra name/number
2) in the install/config scripts, search for an xpcshell process with the given port for the tegra.  
3) keep track of a PID for the xpcshell process when launching it and in the cleanup step ensure that pid is terminated.

I am not sure why we turned on the mobile unittests for m-c without seeing them run for a week or two without problems on the mobile tree...that is just my opinion.
We didn't put them on mozilla-central (though sort of close, M1-4 and crashtest on mozilla-aurora) - it just seems like it because I'm attacking them on mobile as though people have been threatening to put them on mozilla-central, because people have been threatening to put them on mozilla-central. They got a lot more visibility the last couple of weeks since people pushed things that were totally green on try and totally green on cedar to mozilla-central, and then got yelled at and backed out because they broke startup on mobile.
This should potentially be fixed by bug 650828 and bug 650865 (remote unit test pid files + usage of such).
Priority: -- → P3
I hid the mochitests and crashtests (reftests will be permaorange anyway until the next time we merge) on aurora, so someone help me remember to unhide them when 650865 cuts the failure rate down.
This should improve now that bug 650865 has landed (though only new daemons will get pidfiles, so may take a while to shake out old zombie daemons).
Did this maybe get broken by bug 652032, or something else that happened Friday afternoon? Things were looking really good for a while, but now we're back to just as many Timed out while waitings as ever.
Priority: P3 → --
Screw you and your failure to detect midairs and your random changes, Bugzilla.
Priority: -- → P3
It's certainly possible.
I updated sut_tools, which grabbed a new reboot.py.
I also updated talos, which grabbed a new devicemanager that conflicted with the existing [patched] devicemanager.
Any recurrence in the past month?
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306481774.1306482622.21118.gz (and two others so far today) looks characteristic, a "Timed out while waiting for server startup" following a "uncaught exception: 2147746065", though I find it a little suspicious that all three I saw were in just the content/smil/test hunk of mochitest-1.
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306785048.1306786606.23050.gz (tegra-074)
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306775925.1306781372.1642.gz (tegra-074)
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306759163.1306759877.31750.gz (tegra-050)
http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306753846.1306755058.4723.gz (tegra-060)

Maybe it's just a result of the number of runs there, but it feels like when I look at /recent/ for one of these (because I'm seeing way more timed out waiting for startup than I'm seeing crashtest hangs to produce them), there's a try run that blew up preceeding the timeout. Is try maybe not doing the same things to clean up after itself?
(In reply to comment #21)
> Maybe it's just a result of the number of runs there, but it feels like when
> I look at /recent/ for one of these (because I'm seeing way more timed out
> waiting for startup than I'm seeing crashtest hangs to produce them),
> there's a try run that blew up preceeding the timeout. Is try maybe not
> doing the same things to clean up after itself?

I think try is blowing up more than other branches, as it should be. We may need to beef up our cleanup algorithms, as well as potentially split the pool to create dedicated try tegras.
(In reply to comment #22)
> (In reply to comment #21)
> > Maybe it's just a result of the number of runs there, but it feels like when
> > I look at /recent/ for one of these (because I'm seeing way more timed out
> > waiting for startup than I'm seeing crashtest hangs to produce them),
> > there's a try run that blew up preceeding the timeout. Is try maybe not
> > doing the same things to clean up after itself?
> 
> I think try is blowing up more than other branches, as it should be. 


> We may
> need to beef up our cleanup algorithms, as well as potentially split the
> pool to create dedicated try tegras.

For desktop, we use one shared pool of machines for both production testing and try testing, which gets us better capacity scaling. I'd expect to be doing the same shared-pool for tegras. 

(In reply to comment #21)
> Is try maybe not doing the same things to clean up after itself?
Excellent question.
taking this bug as I know I can detect the pattern by checking the buildslave log and adjusting/adding some code to "check.py -c"
Assignee: nobody → bear
Whiteboard: [orange][mobile_unittests] → [orange][mobile_unittests][android_tier_1]
Duplicate of this bug: 660729
(In reply to Mike Taylor [:bear] from comment #32)
> taking this bug as I know I can detect the pattern by checking the
> buildslave log and adjusting/adding some code to "check.py -c"

No update here in over a month - any progress on this android_tier_1 bug?
https://tbpl.mozilla.org/php/getParsedLog.php?id=6369488
https://tbpl.mozilla.org/php/getParsedLog.php?id=6369546
https://tbpl.mozilla.org/php/getParsedLog.php?id=6369547
https://tbpl.mozilla.org/php/getParsedLog.php?id=6369491

Seems a bit odd that this failure mode, which is pretty rare now on mozilla-inbound and mozilla-central, was four of the five failures on a tryserver run someone had me look at.
Apparently just coincidence that I wasn't seeing it off try for a while. And while the summary was accurate once upon a time, I doubt we're even running, much less timing out, enough crashtests to account for all of these.

https://tbpl.mozilla.org/php/getParsedLog.php?id=6375306
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375311
https://tbpl.mozilla.org/php/getParsedLog.php?id=6375151
https://tbpl.mozilla.org/php/getParsedLog.php?id=6368881
Summary: Android mochitests fail with "Timed out while waiting for server startup" after the slave previously timed out during crashtest shutdown → Android tests fail with "Timed out while waiting for server startup"
in remote tegra testing there is a becoming not so rare exception that we throw "unable to launch process".  This is not caught and we don't return an error code from the test.  Consequently we never cleanup the web server.
Attachment #560199 - Flags: review?(jgriffin)
Comment on attachment 560199 [details] [diff] [review]
add exception handling to runapp so we can exit cleanly on tegra mochitest (1.0)

Will this cause the color of the failures of this class to change from purple to orange on TBPL, and if so, does it matter?  If it does, you could call a new function in the except clause, something like shutDownServers, and then re-raise the exception.
Attachment #560199 - Flags: review?(jgriffin) → review+
Color is pretty much meaningless: I'd love to be wrong, but I don't think that |buildduty or |ateamduty trawls through every tree looking at every purple, and orange/red have long since lost all distinction. "Purple" is a failure that's purple, orange, or red, with a bug filed in releng or Testing. ("Purple" is also a couple of test failures that wind up mistakenly purple, or a no-output test timeout on WinXP where we lost the ability to kill the process a few weeks ago, but whatever.)
yes, these will change from purple to orange, but all cleanup will happen and hopefully cleaning up the state of the host and tegra for future runs.

As :philor mentions we are sort of in a blur of color.  According to bear, we need to have return codes in all situations, this was not happening, nor were we seeing the exception in the public facing logs :(

:jgriffin, I don't mind implementing your alternative method to cleanup and rethrow, but I would request we keep it this way so the builtbot code can handle the error more gracefully.
Just realized jgriffin was not cc'd on this bug, adding him so he can reply to comment 58 above.
(In reply to Joel Maher (:jmaher) from comment #58)
> yes, these will change from purple to orange, but all cleanup will happen
> and hopefully cleaning up the state of the host and tegra for future runs.
> 
> As :philor mentions we are sort of in a blur of color.  According to bear,
> we need to have return codes in all situations, this was not happening, nor
> were we seeing the exception in the public facing logs :(
> 
> :jgriffin, I don't mind implementing your alternative method to cleanup and
> rethrow, but I would request we keep it this way so the builtbot code can
> handle the error more gracefully.

If this helps buildbot handle the error gracefully, that's great.  I just wasn't sure whether or not keeping exceptions in the test infrastructure a certain color was important or not.
Whiteboard: [orange][mobile_unittests][android_tier_1] → [orange][mobile_unittests][android_tier_1][inbound]
https://tbpl.mozilla.org/php/getParsedLog.php?id=6405093&tree=Mozilla-Inbound (on the push itself, so it didn't yet get to benefit from it)
https://hg.mozilla.org/mozilla-central/rev/76f8759a4bf8
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [orange][mobile_unittests][android_tier_1][inbound] → [orange][mobile_unittests][android_tier_1]
Depends on: 687098
Whiteboard: [orange][mobile_unittests][android_tier_1] → [mobile_unittests][android_tier_1]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.