tegra/panda health checks (verify.py) should not swallow exceptions

RESOLVED WONTFIX

Status

Release Engineering
General
RESOLVED WONTFIX
4 years ago
3 months ago

People

(Reporter: pmoore, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 2 obsolete attachments)

Sometimes exceptions are "swallowed" in verify.py, e.g.:

https://github.com/mozilla/build-tools/blob/64feaa04f6c63d73c50599109eaa552cc0a6e7a0/sut_tools/verify.py#L122

This means it cannot differentiate e.g. between not being able to connect to a device, or being able to connect, but the "testroot" command failing.

Two changes that would be useful:
1) not to swallow exception, but to output in logs
2) to update error.flg with more specific information about cause of failure

I found this one example, but we can check the rest of the code for blanket "except:" clauses, and try to make sure exceptions are logged.

We can also look to see if there are other types of failure that could get reported in error.log - I cite only one example above but we should check if there are more.
Blocks: 783970
Created attachment 8395718 [details] [diff] [review]
Show testroot failures

I'll add some nosetests later - but here is an explanation of the change.

In the examples below, I test two different tegras.

1) tegra-087 is a tegra with no test root
2) tegra-087v is a pretend tegra - it does not exist

The point of the patch is to allow our automation to be able to differentiate and log these two different root causes.

Code today (before applying patch)
==================================
Before the patch, when running sut_lib.connect(...) both calls would result in:
"BaseException: Failed to connect to SUT Agent and retrieve the device root."

In other words, if you get this exception, you don't know if:
a) the SUT Agent on the device cannot be reached, or
b) the SUT Agent on the device *can* be reached, but the test root cannot be determined


The change, after including this patch
======================================
After the patch, we have:
tegra-087 returns with:
    "mozdevice.devicemanagerSUT.AgentError: Automation Error: Agent Error processing command 'testroot'; err=' unable to determine test root'"
tegra-087v returns with:
    "mozdevice.devicemanagerSUT.AgentError: Remote Device Error: unable to connect to tegra-087v after 5 attempts"

With this patch, it is now possible to differentiate these two separate root causes.

There is still some work to be done:

* creating nosetests
* updating code higher up the stack to process this (e.g. code that generates error.flg to differentiate the two causes)


And now the demonstration of the change above...

Before:
=======

[cltbld@foopy115.tegra.releng.scl3.mozilla.com ~]$ export PYTHONPATH=/builds/sut_tools:/builds/tools/lib/python
[cltbld@foopy115.tegra.releng.scl3.mozilla.com ~]$ /tools/python27/bin/python2.7
Python 2.7.3 (default, Sep 10 2013, 11:27:00) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from mozdevice import devicemanagerSUT as devicemanager
>>> from sut_lib import connect

# tegra-087 has no test root:

>>> dm = connect('tegra-087', 0)
03/24/2014 09:10:45: INFO: Connecting to: tegra-087
reconnecting socket
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/builds/tools/lib/python/sut_lib/__init__.py", line 76, in connect
    return devicemanager.DeviceManagerSUT(deviceIP)
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 56, in __init__
    raise BaseException("Failed to connect to SUT Agent and retrieve the device root.")
BaseException: Failed to connect to SUT Agent and retrieve the device root.

# tegra-087v is a pretend tegra:

>>> dm = connect('tegra-087v', 0)
03/24/2014 09:11:01: INFO: Connecting to: tegra-087v
reconnecting socket
Could not connect; sleeping for 5 seconds.
reconnecting socket
Could not connect; sleeping for 10 seconds.
reconnecting socket
Could not connect; sleeping for 15 seconds.
reconnecting socket
Could not connect; sleeping for 20 seconds.
reconnecting socket
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/builds/tools/lib/python/sut_lib/__init__.py", line 76, in connect
    return devicemanager.DeviceManagerSUT(deviceIP)
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 56, in __init__
    raise BaseException("Failed to connect to SUT Agent and retrieve the device root.")
BaseException: Failed to connect to SUT Agent and retrieve the device root.



After:
======

[cltbld@foopy115.tegra.releng.scl3.mozilla.com ~]$ export PYTHONPATH=/builds/sut_tools:/builds/tools/lib/python
[cltbld@foopy115.tegra.releng.scl3.mozilla.com ~]$ /tools/python27/bin/python2.7
Python 2.7.3 (default, Sep 10 2013, 11:27:00) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from mozdevice import devicemanagerSUT as devicemanager
>>> from sut_lib import connect

# tegra-087 has no test root:

>>> dm = connect('tegra-087', 0)
03/24/2014 09:13:53: INFO: Connecting to: tegra-087
reconnecting socket
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/builds/tools/lib/python/sut_lib/__init__.py", line 76, in connect
    return devicemanager.DeviceManagerSUT(deviceIP)
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 55, in __init__
    if self.getDeviceRoot() == None:
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 993, in getDeviceRoot
    data = self._runCmds([{'cmd': 'testroot'}])
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 163, in _runCmds
    self._sendCmds(cmdlist, outputfile, timeout)
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 144, in _sendCmds
    raise err
mozdevice.devicemanagerSUT.AgentError: Automation Error: Agent Error processing command 'testroot'; err=' unable to determine test root'

# tegra-087v is a pretend tegra:

>>> dm = connect('tegra-087v', 0)
03/24/2014 09:14:30: INFO: Connecting to: tegra-087v
reconnecting socket
Could not connect; sleeping for 5 seconds.
reconnecting socket
Could not connect; sleeping for 10 seconds.
reconnecting socket
Could not connect; sleeping for 15 seconds.
reconnecting socket
Could not connect; sleeping for 20 seconds.
reconnecting socket
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/builds/tools/lib/python/sut_lib/__init__.py", line 76, in connect
    return devicemanager.DeviceManagerSUT(deviceIP)
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 55, in __init__
    if self.getDeviceRoot() == None:
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 993, in getDeviceRoot
    data = self._runCmds([{'cmd': 'testroot'}])
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 163, in _runCmds
    self._sendCmds(cmdlist, outputfile, timeout)
  File "/builds/sut_tools/mozdevice/devicemanagerSUT.py", line 155, in _sendCmds
    raise AgentError("Remote Device Error: unable to connect to %s after %s attempts" % (self.host, self.retrylimit))
mozdevice.devicemanagerSUT.AgentError: Remote Device Error: unable to connect to tegra-087v after 5 attempts
Attachment #8395718 - Flags: review?(bugspam.Callek)
Attachment #8395718 - Flags: review?(bugspam.Callek) → review+
Attachment #8395718 - Flags: checked-in+
https://hg.mozilla.org/build/tools/rev/5c65da65bc8a
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Comment on attachment 8395718 [details] [diff] [review]
Show testroot failures

Had to back this out
https://hg.mozilla.org/build/tools/rev/774756f24bef
Attachment #8395718 - Flags: checked-in+ → checked-in-
This was reverted, irc backlog of why:

[13:33:52]	armenzg_buildduty	Callek, kmoir I believe your most recent change is causing all the issues
[13:34:07]	Callek	ugh reallyz?
[13:34:13]	Callek	armenzg_buildduty: what issues are we looking at?
[13:34:36]	armenzg_buildduty	Callek, https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Android
[13:34:57]	kmoir	ARRGH
[13:35:34]	Callek	armenzg_buildduty: well.... thats verify.py (in the randomly chosen one I looked at)
[13:35:48]	Callek	armenzg_buildduty: failing to ping after trying to reboot a device via the relay board
[13:35:50]	kmoir	armenzg: what is the problem? I don't see it
[13:36:00]	armenzg_buildduty	all the blue jobs
[13:36:16]	kmoir	is that a larger number than usual?
[13:36:18]	Callek	so I don't think this current instance has anything to do with mozharness
[13:36:49]	kmoir	I saw a huge number of retries when running my tests earlier but those were on the new pandas that had moved
[13:36:51]	Callek	something surely looks funky here though
[13:36:56]	Callek	dividehex: ping?
[13:37:03]	RyanVM|sheriffduty	yeah, we don't normally restart ~4/5 jobs
[13:37:10]	dividehex	Callek: what's up?
[13:37:33]	kmoir	the new pandas that were moved are not enabled in buildbot yet this is my understanding
[13:37:36]	Callek	dividehex: soooo, can you take a random sampling of mozpools and see if anything looks unusual here
[13:37:44]	Callek	dividehex: scl1 specifically
[13:37:59]	Callek	dividehex: just look at recent results in https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=Android incase you're wondering what we're seeing
[13:38:27]	kmoir	if it was a mozharness problem no jobs would be running
[13:38:31]	kmoir	green
[13:38:33]	Callek	dividehex: long story short, verify.py (which is cleaning up fennec installs by uninstalling -- within a mozpool request) are rebooting devices via relay board, and not actually seeing them come back up reliably right now
[13:38:51]	Callek	by reliably I mean far far worse than normal
[13:39:08]	armenzg_buildduty	kmoir, unless the greens we see are from barely before the mozharnes jobs
[13:39:15]	armenzg_buildduty	*mozharness merge
[13:39:23]	Callek	kmoir: argh actually I have a sudden suspicion
[13:39:30]	Callek	kmoir: didn't you update tools on friday?
[13:39:41]	kmoir	Callek: yes
[13:41:37]	dividehex	Callek: how are checking the board has come back after a reboot?
[13:42:11]	Callek	kmoir: I have a sinking suspicion with http://hg.mozilla.org/build/tools/rev/4569 btw
[13:43:28]	kmoir	Callek: okay
[13:43:47]	kmoir	Callek: why would this happen now if I updated the tools on Friday
[13:43:59]	Callek	dividehex: we're doing http://mxr.mozilla.org/build/source/tools/lib/python/sut_lib/__init__.py#501
[13:43:59]	catlee	we run about 30 of them constantly
[13:44:05]	kmoir	Callek: wouldn't it show up earlier?
[13:44:19]	Callek	kmoir: I suspect it just wasn't high enough load prior to today
[13:44:29]	RyanVM|sheriffduty	weren't we seeing a bunch on Friday too?
[13:44:35]	RyanVM|sheriffduty	oh yeah, Callek was Called_disconnected by then
[13:44:43]	RyanVM|sheriffduty	tries to remember who he was pinging
[13:44:45]	RyanVM|sheriffduty	coop?
[13:44:49]	kmoir	Callek: and a higher load would cause issues on the relays?
[13:45:18]	Callek	kmoir: possibly, or longer reboot cycles, either way it looks like we need to back out petes change and update tools
[13:45:43]	RyanVM|sheriffduty	coop, was it you I pinged on Friday about pandas disconnecting?
[13:46:11]	Callek	kmoir: note how petes change got rid of a try/except, at line 515 in current patch we check device root against "None" and at line 483 is the function that calls dm.getDeviceRoot()
[13:46:17]	Callek	kmoir: and we never trap exceptions here
[13:46:17]	taras	catlee: since massimo switched most of them to m3.medium
[13:46:36]	Callek	so it bubbles up to cleanup/verify.py as an unhandled exception, that we report on and fail out with
[13:46:55]	RyanVM|sheriffduty	survey says!
[13:46:56]	RyanVM|sheriffduty	1:47:07 PM - RyanVM|sheriffduty: coop|buildduty: seems pandas are having trouble picking up jobs
[13:46:57]	RyanVM|sheriffduty	1:47:17 PM - RyanVM|sheriffduty: https://tbpl.mozilla.org/php/getParsedLog.php?id=39371627&tree=Mozilla-Aurora
[13:46:57]	RyanVM|sheriffduty	1:48:45 PM - Callek is now known as Callek_disconnected.
[13:46:57]	RyanVM|sheriffduty	1:49:03 PM - coop|buildduty: hrmmm
[13:47:12]	Callek	RyanVM|sheriffduty: sorry...
[13:47:16]	kmoir	Callek: right I was seeing that a lot on my test master too
[13:47:27]	Callek	kmoir: do you got that, or shall I do the revert since my name is attached?
[13:47:45]	kmoir	Callek: you can revert it
[13:47:52]	Callek	ok, will-do
[13:47:57]	kmoir	I was in the middle of testing another patch
[13:47:58]	Callek	RyanVM|sheriffduty: kmoir: sorry for the trouble
[13:48:14]	kmoir	is glad it wasn't my mh change
[13:48:35]	kmoir	and is glad Callek found the source of the retry issues, will make testing better
[13:48:40]	Callek	is glad he thought of that correlation!
[13:49:04]	Callek	kmoir: mobile testing may care about this retry spike for their data point
[13:49:04]	armenzg_buildduty	that was a good catch
[13:49:05]	kmoir	maybe those new pandas in the data centre weren't bad afterall
[13:49:11]	Callek	kmoir: even though no meeting this week
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
All foopies have been updated from tools rev 420092165403 --> tools rev 774756f24bef
Sorry guys, and thanks for picking up the pieces. Backing out was the right way to get this back up and running again quickly..

However, I think the fix is still the right way to go, so the calling code higher up the stack needs to be adapted to handle exceptions that previously weren't exceptions - it is important we don't swallow them low down and not have visibility of what the root cause was. I have some other priorities at the moment, so can't look at this now, but at some point I'd like to understand what caused this failure.
Callek, can you help me to understand the following section?

https://hg.mozilla.org/build/mozharness/file/201775c5cbd1/scripts/android_panda.py#l186

   186     def request_device(self):
   187         self.retrieve_android_device(b2gbase="")
   188         env = self.query_env()
   189         cmd = [self.query_exe('python'), self.config.get("verify_path")]
   190         if self.run_command(cmd, env=env):
   191             self.critical("Preparing to abort run due to failed verify check.")
   192             self.close_request()
   193             self.fatal("Dying due to failing verification")
   194         else:
   195             self.info("Successfully verified the device")

The "if" statement in line 190 seems to be inverted: if self.run_command(cmd, env=env) returns true, I believe the verify was successful, and if self.run_command(cmd, env=env) returns false, I believe the verify was unsuccessful. However, the "true" block (l191-193) contains the failure steps, and the "else" block (l195) contains the success steps. This appears to me to be the wrong way around. What am I missing? Or is this a sinister bug, that has been hidden because the verify has always been failing for some reason, so this makes it look like the verify was always passing?
Flags: needinfo?(bugspam.Callek)
the return code is the exit code from the python .../verify.py process. Which if 1 is "error happened" 0 is "success" so convert that to python truthiness and True is failure.
Flags: needinfo?(bugspam.Callek)
Created attachment 8422347 [details] [diff] [review]
Just log error, rather than throwing it

Hey Callek,

Since dm.getDeviceRoot() is called in lots of places, rather than changing its behaviour to throw an error, this alternative patch just logs the error, but proceeds as normal. So this is just additional debugging, which is useful when you don't know if the problem is that you can't connect to the device, or you can connect but cannot retrieve the test root.

Pete
Assignee: nobody → pmoore
Attachment #8395718 - Attachment is obsolete: true
Attachment #8422347 - Flags: review?(bugspam.Callek)
Incidentally, I think there may be a race condition.

In the SUT agent code, when you retrieve the test root, if it is not set, it stores the testroot variable as the error message you get. Therefore, the first time you are able to successfully query the test root, if it is not available, the test root variable internally will be set to the error message, and any subsequent attempts to retrieve the test root will use this "cached" value of the error message (in other words, it will never try to retrieve it again). The possible race condition is that when we reboot devices, we keep querying the device to ask it for its test root, to see if it is up and ready again. If we were able to query the test root before the SD card is mounted, or before the test root is internally set, we would in fact set the test root internal variable to the error message, and that device would never get its test root set, and would always fail verify.py tests, until it was rebooted.

This might be a cause of intermittent failures we experience.

This needs to be fixed in the SUT Agent itself, so that it does not abuse the use of the internal testroot variable by setting it to an error message, so that if you query it, and previously it was not ready and had not been set, it would try again to retrieve it, rather than assuming that if it failed before, it will fail again.
I've raised bug 1010173 for the above issue (comment 10).
Comment on attachment 8422347 [details] [diff] [review]
Just log error, rather than throwing it

From IRC:

Callek: pmoore: change to |print "Caught Exception:", e| and you have an r+
Attachment #8422347 - Flags: review?(bugspam.Callek) → review+
https://hg.mozilla.org/build/tools/rev/cb82132dc905
Created attachment 8423716 [details] [diff] [review]
tools patch including Callek's review feedback

Updated patch, including Callek's review feedback from IRC.

Checked in and pushed:
https://hg.mozilla.org/build/tools/rev/cb82132dc905
Attachment #8422347 - Attachment is obsolete: true
Attachment #8423716 - Flags: review+
Attachment #8423716 - Flags: checked-in+

Updated

3 years ago
Depends on: 1019038
I still need to go through the code and find the other except clauses, before I close this bug, which I hope to get a chance to do in July.
Blocks: 1018118
No longer blocks: 783970
Assignee: pmoore → nobody
No longer used
Status: REOPENED → RESOLVED
Last Resolved: 3 years agoa year ago
Resolution: --- → WONTFIX
(Assignee)

Updated

3 months ago
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.