Closed Bug 687098 Opened 13 years ago Closed 12 years ago

Android tests fail with "Timed out while waiting for server startup"

Categories

(Release Engineering :: General, defect, P3)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [mobile_unittests][android_tier_1][android_tier_∞])

+++ This bug was initially created as a clone of Bug #650535 +++

https://tbpl.mozilla.org/php/getParsedLog.php?id=6421492&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound opt test jsreftest-1 on 2011-09-16 02:42:38 PDT 

unable to execute ADB: ensure Android SDK is installed and adb is in your $PATH
restarting as root failed
reconnecting socket
args: ['../hostutils/bin/xpcshell', '-g', '/builds/tegra-051/test/build/hostutils/xre', '-v', '170', '-f', '/builds/tegra-051/test/build/tests/reftest/reftest/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpweLkrF';const _SERVER_PORT = '30051'; const _SERVER_ADDR ='10.250.48.202';", '-f', '/builds/tegra-051/test/build/tests/reftest/server.js']
INFO | remotereftests.py | Server pid: 14329
uncaught exception: 2147746065
Timed out while waiting for server startup.
program finished with exit code 1
elapsedTime=91.871454
TinderboxPrint: jsreftest-1<br/><em class="testfail">T-FAIL</em>
Unknown Error: command finished with exit code: 1
https://tbpl.mozilla.org/php/getParsedLog.php?id=6627961&tree=Mozilla-Inbound

Starting to smell a "some recent change left us not always managing to kill the previous run" thing.
https://tbpl.mozilla.org/php/getParsedLog.php?id=6948810&tree=Mozilla-Inbound

I only looked at buildapi/recent for one of these, to see what bad company it had been keeping before me, but that one had previously done a release job, which ended in purple tears. Dunno how to find out what sort of purple, though, or who did what to whom to leave it in this state.
Depends on: 690311
I got suspicious enough about the clusters of this we see to look at what the slaves had been up to, and it apparently goes like this:

1. push to try, -p a -u a
2. Linux and Android build first, you see your test fail on Linux and hit the big red button to cancel all jobs on the push
3. The 32 Android slaves that were in the middle of tests are now primed to do this, next job they pick up, from any tree

If we can't fix the way they cancel, can we just make Android test jobs completely ignore self-serve attempts to cancel them?
https://tbpl.mozilla.org/php/getParsedLog.php?id=7847885&tree=Mozilla-Beta rather quickly insisted on proving me not entirely correct, since its previous run was the 2400 second timeout kill in https://tbpl.mozilla.org/php/getParsedLog.php?id=7847311&tree=Mozilla-Inbound
Interesting that this went quiet for a little while and then came back on December 12.  Did anything change in the infrastructure at that point? Or, is it a case that we weren't watching the birch tree for failures and we only saw this show back up when we started starring failures on m-c after the birch merge? 

http://brasstacks.mozilla.com/orangefactor/?display=Bug&tree=mozilla-central&endday=2011-12-19&startday=2011-11-01&bugid=687098
If you look at mozilla-central as though it's where things happen, like it used to be, you will be deceived.

What happened on December 12th was that inbound was still closed from the PGO fun, but central was (mostly) open, so for a change central was where the pushes were. December 7th and 8th, where your chart is all quiet (but inbound is not), there were only four pushes a day to central. We should probably switch WOO's default view to be inbound+central.

But you need even more insight into what's happening outside the chart to interpret it, since the prime driver for this failure mode is "people cancelling try pushes while the Android tests are still running" which is a tough thing to match up with the chart. Mobile people were piling on the tree like crazy last week, were they pushing a lot of half-formed thoughts to try, and cancelling them when the first test failure showed up?
https://tbpl.mozilla.org/php/getParsedLog.php?id=8052472&tree=Mozilla-Inbound

(A cute one, because it timed out and killed itself one hunk, and then as a result timed out waiting for server startup in the next hunk.)
(In reply to Phil Ringnalda (:philor) from comment #161)
> If you look at mozilla-central as though it's where things happen, like it
> used to be, you will be deceived.
> 
> What happened on December 12th was that inbound was still closed from the
> PGO fun, but central was (mostly) open, so for a change central was where
> the pushes were. December 7th and 8th, where your chart is all quiet (but
> inbound is not), there were only four pushes a day to central. We should
> probably switch WOO's default view to be inbound+central.
> 
> But you need even more insight into what's happening outside the chart to
> interpret it, since the prime driver for this failure mode is "people
> cancelling try pushes while the Android tests are still running" which is a
> tough thing to match up with the chart. Mobile people were piling on the
> tree like crazy last week, were they pushing a lot of half-formed thoughts
> to try, and cancelling them when the first test failure showed up?

Interesting, thanks for the background.  Is there a bug open to switch the WOO to consider inbound+central?  That seems like something we should do post haste.
https://tbpl.mozilla.org/php/getParsedLog.php?id=8259820&tree=Mozilla-Inbound

Sort of makes you wonder who killed their running Android tests on try, doesn't it?
Whiteboard: [orange][mobile_unittests][android_tier_1] → [orange][mobile_unittests][android_tier_1][triagefollowup]
Assignee: nobody → coop
Whiteboard: [orange][mobile_unittests][android_tier_1][triagefollowup] → [orange][mobile_unittests][android_tier_1]
Something new is busted, because these aren't just killed jobs on try.

https://tbpl.mozilla.org/php/getParsedLog.php?id=9246122&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=9479426&tree=Mozilla-Inbound - it'd been so long, I thought some thing I didn't understand that landed back then had either fixed this, or managed to swallow the message that identified this.
Assignee: coop → nobody
Assignee: nobody → philringnalda
Whiteboard: [orange][mobile_unittests][android_tier_1] → [orange][mobile_unittests][android_tier_1][android_tier_∞]
Assignee: philringnalda → nobody
not seen in a long time
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Whiteboard: [orange][mobile_unittests][android_tier_1][android_tier_∞] → [mobile_unittests][android_tier_1][android_tier_∞]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.