Android tests fail with "Timed out while waiting for server startup"

RESOLVED WORKSFORME

Status

defect
P3
normal
RESOLVED WORKSFORME
8 years ago
6 years ago

People

(Reporter: philor, Unassigned)

Tracking

({intermittent-failure})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [mobile_unittests][android_tier_1][android_tier_∞])

(Reporter)

Description

8 years ago
+++ This bug was initially created as a clone of Bug #650535 +++

https://tbpl.mozilla.org/php/getParsedLog.php?id=6421492&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound opt test jsreftest-1 on 2011-09-16 02:42:38 PDT 

unable to execute ADB: ensure Android SDK is installed and adb is in your $PATH
restarting as root failed
reconnecting socket
args: ['../hostutils/bin/xpcshell', '-g', '/builds/tegra-051/test/build/hostutils/xre', '-v', '170', '-f', '/builds/tegra-051/test/build/tests/reftest/reftest/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpweLkrF';const _SERVER_PORT = '30051'; const _SERVER_ADDR ='10.250.48.202';", '-f', '/builds/tegra-051/test/build/tests/reftest/server.js']
INFO | remotereftests.py | Server pid: 14329
uncaught exception: 2147746065
Timed out while waiting for server startup.
program finished with exit code 1
elapsedTime=91.871454
TinderboxPrint: jsreftest-1<br/><em class="testfail">T-FAIL</em>
Unknown Error: command finished with exit code: 1
(Reporter)

Comment 9

8 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=6627961&tree=Mozilla-Inbound

Starting to smell a "some recent change left us not always managing to kill the previous run" thing.
(Reporter)

Comment 40

8 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=6948810&tree=Mozilla-Inbound

I only looked at buildapi/recent for one of these, to see what bad company it had been keeping before me, but that one had previously done a release job, which ended in purple tears. Dunno how to find out what sort of purple, though, or who did what to whom to leave it in this state.

Updated

8 years ago
Depends on: 690311
(Reporter)

Comment 143

8 years ago
I got suspicious enough about the clusters of this we see to look at what the slaves had been up to, and it apparently goes like this:

1. push to try, -p a -u a
2. Linux and Android build first, you see your test fail on Linux and hit the big red button to cancel all jobs on the push
3. The 32 Android slaves that were in the middle of tests are now primed to do this, next job they pick up, from any tree

If we can't fix the way they cancel, can we just make Android test jobs completely ignore self-serve attempts to cancel them?
(Reporter)

Comment 144

8 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=7847885&tree=Mozilla-Beta rather quickly insisted on proving me not entirely correct, since its previous run was the 2400 second timeout kill in https://tbpl.mozilla.org/php/getParsedLog.php?id=7847311&tree=Mozilla-Inbound
Duplicate of this bug: 710280

Comment 160

7 years ago
Interesting that this went quiet for a little while and then came back on December 12.  Did anything change in the infrastructure at that point? Or, is it a case that we weren't watching the birch tree for failures and we only saw this show back up when we started starring failures on m-c after the birch merge? 

http://brasstacks.mozilla.com/orangefactor/?display=Bug&tree=mozilla-central&endday=2011-12-19&startday=2011-11-01&bugid=687098
(Reporter)

Comment 161

7 years ago
If you look at mozilla-central as though it's where things happen, like it used to be, you will be deceived.

What happened on December 12th was that inbound was still closed from the PGO fun, but central was (mostly) open, so for a change central was where the pushes were. December 7th and 8th, where your chart is all quiet (but inbound is not), there were only four pushes a day to central. We should probably switch WOO's default view to be inbound+central.

But you need even more insight into what's happening outside the chart to interpret it, since the prime driver for this failure mode is "people cancelling try pushes while the Android tests are still running" which is a tough thing to match up with the chart. Mobile people were piling on the tree like crazy last week, were they pushing a lot of half-formed thoughts to try, and cancelling them when the first test failure showed up?
(Reporter)

Comment 166

7 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=8052472&tree=Mozilla-Inbound

(A cute one, because it timed out and killed itself one hunk, and then as a result timed out waiting for server startup in the next hunk.)

Comment 167

7 years ago
(In reply to Phil Ringnalda (:philor) from comment #161)
> If you look at mozilla-central as though it's where things happen, like it
> used to be, you will be deceived.
> 
> What happened on December 12th was that inbound was still closed from the
> PGO fun, but central was (mostly) open, so for a change central was where
> the pushes were. December 7th and 8th, where your chart is all quiet (but
> inbound is not), there were only four pushes a day to central. We should
> probably switch WOO's default view to be inbound+central.
> 
> But you need even more insight into what's happening outside the chart to
> interpret it, since the prime driver for this failure mode is "people
> cancelling try pushes while the Android tests are still running" which is a
> tough thing to match up with the chart. Mobile people were piling on the
> tree like crazy last week, were they pushing a lot of half-formed thoughts
> to try, and cancelling them when the first test failure showed up?

Interesting, thanks for the background.  Is there a bug open to switch the WOO to consider inbound+central?  That seems like something we should do post haste.
(Reporter)

Comment 173

7 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=8259820&tree=Mozilla-Inbound

Sort of makes you wonder who killed their running Android tests on try, doesn't it?
Whiteboard: [orange][mobile_unittests][android_tier_1] → [orange][mobile_unittests][android_tier_1][triagefollowup]
Assignee: nobody → coop
Whiteboard: [orange][mobile_unittests][android_tier_1][triagefollowup] → [orange][mobile_unittests][android_tier_1]
(Reporter)

Comment 265

7 years ago
Something new is busted, because these aren't just killed jobs on try.

https://tbpl.mozilla.org/php/getParsedLog.php?id=9246122&tree=Mozilla-Inbound
(Reporter)

Comment 323

7 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=9479426&tree=Mozilla-Inbound - it'd been so long, I thought some thing I didn't understand that landed back then had either fixed this, or managed to swallow the message that identified this.
Assignee: coop → nobody
(Reporter)

Updated

7 years ago
Assignee: nobody → philringnalda
Whiteboard: [orange][mobile_unittests][android_tier_1] → [orange][mobile_unittests][android_tier_1][android_tier_∞]
(Reporter)

Updated

7 years ago
Assignee: philringnalda → nobody
not seen in a long time
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WORKSFORME
Whiteboard: [orange][mobile_unittests][android_tier_1][android_tier_∞] → [mobile_unittests][android_tier_1][android_tier_∞]
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.