Closed Bug 835588 Opened 11 years ago Closed 8 years ago

Intermittent "BaseException: Failed to connect to SUT Agent and retrieve the device root."

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2649] [retry])

Amusingly enough, we've put this in bug 778688 multiple times, but we don't seem to have actually filed it.

https://tbpl.mozilla.org/php/getParsedLog.php?id=19210080&tree=Mozilla-Inbound
Android Tegra 250 mozilla-inbound opt test mochitest-5 on 2013-01-28 13:39:56 PST for push 0c45e6378f1f
slave: tegra-182

========= Started Install App on Device failed (results: 2, elapsed: 1 mins, 57 secs) (at 2013-01-28 13:42:57.709946) =========
python /builds/sut_tools/installApp.py 10.250.50.92 build/fennec-21.0a1.en-US.android-arm.apk org.mozilla.fennec
 in dir /builds/tegra-182/test/. (timeout 1200 secs)
 watching logfiles {}
 argv: ['python', '/builds/sut_tools/installApp.py', '10.250.50.92', u'build/fennec-21.0a1.en-US.android-arm.apk', 'org.mozilla.fennec']
 environment:
  HOME=/Users/cltbld
  PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/builds/tegra-182/test
  SUT_IP=10.250.50.92
  SUT_NAME=tegra-182
  __CF_USER_TEXT_ENCODING=0x1F5:0:0
 closing stdin
 using PTY: False
01/28/2013 13:42:57: INFO: copying build/fennec/application.ini to build/talos/remoteapp.ini
01/28/2013 13:42:57: DEBUG: calling [cp build/fennec/application.ini build/talos/remoteapp.ini]
reconnecting socket
01/28/2013 13:42:57: DEBUG: cp: build/talos/remoteapp.ini: No such file or directory
01/28/2013 13:42:57: INFO: connecting to: 10.250.50.92
Could not connect; sleeping for 5 seconds.
reconnecting socket
Could not connect; sleeping for 10 seconds.
reconnecting socket
Could not connect; sleeping for 15 seconds.
reconnecting socket
Could not connect; sleeping for 20 seconds.
reconnecting socket
Traceback (most recent call last):
  File "/builds/sut_tools/installApp.py", line 187, in <module>
    sys.exit(main(sys.argv))
  File "/builds/sut_tools/installApp.py", line 166, in main
    dm, devRoot = one_time_setup(ip_addr, path_to_main_apk)
  File "/builds/sut_tools/installApp.py", line 115, in one_time_setup
    dm = devicemanager.DeviceManagerSUT(ip_addr)
  File "/builds/tools/sut_tools/mozdevice/devicemanagerSUT.py", line 53, in __init__
    raise BaseException("Failed to connect to SUT Agent and retrieve the device root.")
BaseException: Failed to connect to SUT Agent and retrieve the device root.
program finished with exit code 1
elapsedTime=117.386960
========= Finished Install App on Device failed (results: 2, elapsed: 1 mins, 57 secs) (at 2013-01-28 13:44:55.121868) =========

Seems like the sort of thing that we ought to be catching and setting RETRY on.
how frequent is it?
Whiteboard: [retry]
It happened on 2012-11-19, 2013-01-14 and 2013-01-28. Since it doesn't use a message that tbpl parses, so it's annoying to even look at, it's hard to say how often it happens and is ignored. I'd guess on the order of twice a week.
https://tbpl.mozilla.org/php/getParsedLog.php?id=19830442&tree=Mozilla-Inbound

And there are more of these in bug 820851, since a dead panda has pretty fair odds of failing to reboot, and unlike this that gets picked up by tbpl.
Justin, please can you take a look at this, or find a more appropriate owner - thanks :-)
Assignee: nobody → bugspam.Callek
Flags: needinfo?(bugspam.Callek)
Coop, this is a good one for whomever will do the sut* refactoring we discussed. Its going to be harder before the refactoring but shouldn't be too hard for an easy-sheriff-win. Feel free to reassign, as I'm unsure when I would get to it myself.
Flags: needinfo?(bugspam.Callek) → needinfo?(coop)
Assignee: bugspam.Callek → pmoore
Blocks: 850572
Flags: needinfo?(coop)
Depends on: 816971
Hi guys,

I'll look at this as soon as possible - I just have to finish off another ticket first. I will start discussions with Callek this week, and hope to have a concentrated stab at these mobile issues next week.

Thanks,
Pete
Status: NEW → ASSIGNED
No longer blocks: 778688
(In reply to Pete Moore [:pete][:pmoore] from comment #64)
> Hi guys,
> 
> I'll look at this as soon as possible - I just have to finish off another
> ticket first. I will start discussions with Callek this week, and hope to
> have a concentrated stab at these mobile issues next week.
> 
> Thanks,
> Pete

Any news on this? :-)
Flags: needinfo?(pmoore)
Hi guys,

Sorry about the delay on this - I will look into this now.

Pete
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pete][:pmoore] from comment #246)
> Hi guys,
> 
> Sorry about the delay on this - I will look into this now.
> 
> Pete

Any luck with this? :-)
(In reply to Ed Morley [:edmorley UTC+1] from comment #328)
> (In reply to Pete Moore [:pete][:pmoore] from comment #246)
> > Hi guys,
> > 
> > Sorry about the delay on this - I will look into this now.
> > 
> > Pete
> 
> Any luck with this? :-)
Flags: needinfo?(pmoore)
Hi Ed,

Again apologies, I've not got closer to identifying the cause of this yet. I hope to give this more attention when I have the watcher updated on all staging pandas and tegras.

Thanks,
Pete
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pete][:pmoore] from comment #351)
> Hi Ed,
> 
> Again apologies, I've not got closer to identifying the cause of this yet. I
> hope to give this more attention when I have the watcher updated on all
> staging pandas and tegras.
> 
> Thanks,
> Pete

No worries :-)
Product: mozilla.org → Release Engineering
Hi Ed,

The work on upgrading the watcher has finished (at least in Staging, it still needs to be rolled out to production) but I am now available to look into this in more detail. I'll keep the bug posted with progress.

Thanks,
Pete
No time to work on this at the moment, sorry guys. Assigning back to nobody...
Assignee: pmoore → nobody
Whiteboard: [retry] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2642] [retry]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2642] [retry] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2649] [retry]
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.