gaia-ui-test on Cedar in an infinite retry from "Bad return status from http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0133/request/: 409!"

RESOLVED FIXED

Status

Release Engineering
General
P1
normal
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: philor, Assigned: armenzg)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

5 years ago
The first few failures in https://tbpl.mozilla.org/?tree=Cedar&rev=db917d782295&jobname=gaia-ui-test look like they were mid-deploy of something, between 15:22 and 15:27, with various failures to connect and 502s, then it has settled into being deeply unhappy about getting a 409:

https://tbpl.mozilla.org/php/getParsedLog.php?id=18381488&tree=Cedar

21:39:42     INFO - Request POST http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0146/request/...
21:39:43    ERROR - Bad return status from http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0146/request/: 409!
21:39:43  WARNING - Device unavailable.  Retry#1..
21:40:43     INFO - Request POST http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0146/request/...
21:40:43    ERROR - Bad return status from http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0146/request/: 409!
21:40:43  WARNING - Device unavailable.  Retry#2..
21:41:43     INFO - Request POST http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0146/request/...
21:41:44    ERROR - Bad return status from http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0146/request/: 409!
21:41:44  WARNING - Device unavailable.  Retry#3..
21:42:44     INFO - Request POST http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0146/request/...
21:42:44    ERROR - Bad return status from http://mobile-imaging-001.p1.releng.scl1.mozilla.com/api/device/panda-0146/request/: 502!
21:42:44  WARNING - # TBPL RETRY #
21:42:44    FATAL - We could not request the device: mozpool status not ok, code 502
21:42:44    FATAL - Exiting -1

Amusing though infinite loops are, I killed the 90th attempt, but if the same thing is going to happen to the Android 4.0 tests that run on Pandas on mozilla-central (none have run since the apparent start of the bustage, between 15:07 and 15:22), we're going to rather quickly get every Panda tied up once people get over their hangovers and start pushing there.
Here's the mozpool log for 21:41:44:

request.1218 INFO - [2012-12-31 21:41:43,727] entering state finding_device
request.1218 INFO - [2012-12-31 21:41:43,747] Finding device.
request.1218 INFO - [2012-12-31 21:41:43,786] Assigning device panda-0146.
request.1218 INFO - [2012-12-31 21:41:43,809] Request succeeded.
request.1218 INFO - [2012-12-31 21:41:43,819] entering state contacting_lifeguard
request.1218 ERROR - [2012-12-31 21:41:43,879] Assigned device panda-0146 is in unexpected state failed_b2g_downloading when about to contact lifeguard.
request.1218 INFO - [2012-12-31 21:41:43,889] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-31 21:41:43,927] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0146 WARNING - [2012-12-31 21:41:43,952] ignored event free in state failed_b2g_downloading
127.0.0.1:60814 - - [31/Dec/2012 21:41:43] "HTTP/1.1 POST /api/device/panda-0146/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-31 21:41:43,971] "POST /api/device/panda-0146/event/free/ HTTP/1.1" 200 2
request.1218 INFO - [2012-12-31 21:41:43,972] entering state closed
127.0.0.1:60812 - - [31/Dec/2012 21:41:44] "HTTP/1.1 POST /api/device/panda-0146/request/" - 409 Conflict

and for 21:42:44, although notice the time shift:

request.1219 INFO - [2012-12-31 21:42:49,527] entering state finding_device
request.1219 INFO - [2012-12-31 21:42:49,546] Finding device.
request.1219 INFO - [2012-12-31 21:42:49,585] Assigning device panda-0146.
request.1219 INFO - [2012-12-31 21:42:49,607] Request succeeded.
request.1219 INFO - [2012-12-31 21:42:49,617] entering state contacting_lifeguard
request.1219 ERROR - [2012-12-31 21:42:49,675] Assigned device panda-0146 is in unexpected state failed_b2g_downloading when about to contact lifeguard.
request.1219 INFO - [2012-12-31 21:42:49,685] entering state device_busy
requests.packages.urllib3.connectionpool INFO - [2012-12-31 21:42:49,720] Starting new HTTP connection (1): mobile-imaging-001.p1.releng.scl1.mozilla.com
device.panda-0146 WARNING - [2012-12-31 21:42:49,745] ignored event free in state failed_b2g_downloading
127.0.0.1:60818 - - [31/Dec/2012 21:42:49] "HTTP/1.1 POST /api/device/panda-0146/event/free/" - 200 OK
requests.packages.urllib3.connectionpool DEBUG - [2012-12-31 21:42:49,762] "POST /api/device/panda-0146/event/free/ HTTP/1.1" 200 2
request.1219 INFO - [2012-12-31 21:42:49,763] entering state closed
127.0.0.1:60816 - - [31/Dec/2012 21:42:49] "HTTP/1.1 POST /api/device/panda-0146/request/" - 409 Conflict

The 409 is occurring because that device is already in use for some other purpose.  Maybe another foopy is talking to it?

I suspect that the 502 is a timeout of the request operation.  I think Apache is limiting the number of backend connections, and they're queueing up.  I'll see if I can fix that.
Bug 825766 for the 502s, but note again that's not the core problem here.

Looking at the UI, that device is in the failed_b2g_downloading state, from back on the 17th:

2012-12-17T08:25:42 statemachine entering state b2g_downloading
2012-12-17T08:25:42 syslog getting B2G_URL from http://10.12.128.33/api/device/panda-0146/bootconfig/
2012-12-17T08:25:43 syslog B2G URL: http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/1355755042
2012-12-17T08:25:43 syslog fetching http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/cedar-panda/1355755042/system.tar.bz2
2012-12-17T08:25:43 syslog wget failed:
2012-12-17T08:28:48 statemachine entering state failed_b2g_downloading
2012-12-17T08:28:48 statemachine device has failed: failed_b2g_downloading

It looks like that URL simply didn't exist.  I'll run an Android install on that device to test it.  If it's successful, it will go to the "free" state and things will start working again.  If there are other devices with this problem, do the same ("please_image with image android" in the lifeguard UI).

The Buildbot harness should probably be monitoring the device state and shutting down the buildslave instance when the device has failed.  I'll leave this bug open for that purpose.

The mozpool-based nagios monitoring will help us catch these things more quickly - bug 815765.
(Assignee)

Updated

5 years ago
Assignee: nobody → armenzg
(Assignee)

Updated

5 years ago
Priority: -- → P2
(Reporter)

Comment 3

5 years ago
Score on Cedar for today: 158 runs of gaia-ui-test on 4 pushes, 5 successful runs. Hope we've got lots of pandas.
(Assignee)

Updated

5 years ago
Priority: P2 → P1
(Assignee)

Comment 4

5 years ago
The recent large amount was a fallout from bug 826694.
I just got a green run.
(Assignee)

Comment 5

5 years ago
The issue is not happening anymore.
A looping issue like this is either caused by problems with mozpool.

In the future for handling issues like this please follow this documentation:
https://wiki.mozilla.org/ReleaseEngineering:Buildduty#Pandas
(Assignee)

Updated

5 years ago
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.