All users were logged out of Bugzilla on October 13th, 2018
Bug 900873 (panda-0044)

panda-0044 problem tracking

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
5 years ago
5 months ago

People

(Reporter: emorley, Unassigned)

Tracking

Details

(Whiteboard: [buildduty][buildslaves][capacity])

(Reporter)

Description

5 years ago
Failing over 50% of the jobs it's done:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0044

eg:
https://tbpl.mozilla.org/php/getParsedLog.php?id=26061280&tree=Try#error0

{
00:26:26     INFO -  08/02/2013 00:26:26: INFO: Uninstalling org.mozilla.fennec...
00:26:28     INFO -  08/02/2013 00:26:28: WARNING: Automation Error: Unable to reboot panda-0044 via Relay Board.
00:26:28     INFO -  08/02/2013 00:26:28: WARNING: Automation Error: Unable to reboot panda-0044 via Relay Board.
00:26:28     INFO -  08/02/2013 00:26:28: WARNING: Automation Error: Unable to reboot panda-0044 via Relay Board.
00:26:28     INFO -  08/02/2013 00:26:28: WARNING: Automation Error: Unable to reboot panda-0044 via Relay Board.
00:26:28     INFO -  08/02/2013 00:26:28: WARNING: Automation Error: Unable to reboot panda-0044 via Relay Board.
00:26:28     INFO -  08/02/2013 00:26:28: INFO: verifyDevice: failing to cleanup device
00:26:28     INFO -  reconnecting socket
00:26:28     INFO -  removing file: /mnt/sdcard/writetest
00:26:28     INFO -  reconnecting socket
00:26:28     INFO -  reconnecting socket
00:26:28     INFO -  reconnecting socket
00:26:28     INFO -  reconnecting socket
00:26:28    ERROR - Return code: 1
00:26:28 CRITICAL - Preparing to abort run due to failed verify check.
00:26:28     INFO - Request 'http://mobile-imaging-010.p10.releng.scl1.mozilla.com/api/request/284670/' deleted on cleanup
00:26:28    FATAL - Dieing due to failing verification
00:26:28    FATAL - Exiting -1
00:26:28     INFO - Running post-action listener: _resource_record_post_action
}

Please disable.
Flags: needinfo?(bugspam.Callek)
Forced state "disabled" via lifeguard
Flags: needinfo?(bugspam.Callek)
I might be reading this wrong but it looks like the relay board is being contacted directly and its making a request to mozpool.  Those two don't mix.

Callek, am I reading that right?
Flags: needinfo?(bugspam.Callek)
For the record, changing the state in mozpool is absolutely, unequivocally, 100% the wrong thing to do here.

IRC conversation suggests that there's no way in the releng automation to disable a particular panda.  If that's true, then that needs to be fixed quickly, and in the interim, manually killing clientproxy/buildslave processes is a better solution.
panda-0044 is not listed in devices.json so we have no mapping from panda->foopy.  I'm not sure how that happened (I know now that it's a bug), but it led me to believe that using mozpool was the way to disable the slave.  I then reused that logic on other mozpool-managed slaves.

I understand now that *all* pandas are managed via the disabled flag on a foopy, and that this specific instance with panda-0044 was a one-off.  I've added this to our wiki documentation.  Having said that, I've searched the devices.json history in Mercurial and not found any matches for panda-0044.

After looking at all the foopies, I found that panda-0044 is hosted on foopy103.  I've added it to devices.json based on details in Inventory and stopped it via manage_foopies.py.
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
Depends on: 902657
(In reply to Jake Watkins [:dividehex] from comment #2)
> I might be reading this wrong but it looks like the relay board is being
> contacted directly and its making a request to mozpool.  Those two don't mix.
> 
> Callek, am I reading that right?

We've been over this before, after we request from mozpool we *do* do stuff that involves direct-relay-board. We can't fix that until after all pandas are handled with mozpool
Flags: needinfo?(bugspam.Callek)
Sending this slave to recovery
-->Automated message.
recovered by "panda-recovery" bug 902657
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED

Updated

4 years ago
Depends on: 1148116

Updated

5 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.