Closed Bug 641475 Opened 13 years ago Closed 13 years ago

Android Tegras are all purple

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mfinkle, Assigned: jmaher)

References

Details

Attachments

(3 files, 6 obsolete files)

All the Android tegras used for talos and testing seem to be purple.
Assignee: nobody → aki
I've found some bugs in our SUT helper scripts.
Working on production systems is never ideal, but as our production systems are currently all broken, I'm hoping it's faster than iterating in staging.
This is what I've got running in production, which seems to have fixed it for now.

Notably:

1) I had failed to start the buildslaves with SUT_IP or SUT_NAME defined (and check.py -r didn't start the buildslave at all).  So I wrote in support for figuring this stuff out, though the Talos remotePerfConfigurator step still needs SUT_IP.  We can probably just use the slave name property and get rid of these env vars.

2) I changed installApp.py to use uninstallAppAndReboot() + installApp(), because:

 a) updateApp() assumed that we would always reboot. I strongly suspect that if the apk is the exact same apk that is currently installed, nothing happens, and then the callback server waits for a ping that never comes and the tegra goes purple.

 b) I don't trust the callback server at all at this point (see reboot.py, below).  So I'm waiting til devRoot is back, then checking to see if Fennec is installed (actually I check this in 3 places, once before uninstalling, once after uninstalling, and once after installing).

3) reboot.py is broken. The only reason it works is the dm.reboot() timeout is generally > time it takes to reboot, and < time it takes for buildbot to error out.  It relies on the callback server but never gets a call back. I manually telnetted to the tegra while reboot.py was still waiting, and did an |info uptime| which was <5min.  I'm going to edit this to do similar polling, using |info uptime|, which will probably be more reliable.

4) At this point I'm not entirely sure what we gain by having clientproxy.py and check.py since I didn't really see buildslaves being affected by these at all.  I'm thinking, going forward,

 a) Add a step at the very beginning of all remote tests that checks for tegra connectivity that tries for X amount of time, then kicks the job into RETRY mode and takes the buildslave offline.

 b) Edit check.py to actually start the buildslave if it's -r and it's not running

 c) Change the test factories and scripts to use the slavename rather than SUT_IP and SUT_NAME.  Also, the scripts should all be usable via commandline in a user-friendly way with expected behavior.

Any comments/questions/objections?
Attached patch devicemanager patch on foopy01 (obsolete) — Splinter Review
This is the existing devicemanager patch on foopy01.  I don't think it's all necessary or very clean, but it seems to be working atm.
http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mobile looks a lot better.
The immediate workaround is in, though un-landed.

We (ateam+releng+anyone else who wants to) need to also discuss comment 2.

I'm going to start working out of user repos to get this stuff sorted out in a cleaner fashion.
Refactoring into a functions.py and keep wishing I had a central config.
Currently trying not to rewrite the sut factories in mozharness, but it may come to that.
Ok, now that things are looking like they're moving to resolution, a summary of where things are as I understand them.

(In reply to comment #2)
> 1) I had failed to start the buildslaves with SUT_IP or SUT_NAME defined (and
> check.py -r didn't start the buildslave at all).  So I wrote in support for
> figuring this stuff out, though the Talos remotePerfConfigurator step still
> needs SUT_IP.  We can probably just use the slave name property and get rid of
> these env vars.

As Bear pointed out in IRC (though I argued rather loudly) I failed to do start clientproxy properly... I did a

  cd builds
  for i in tegra-*; do
    echo $i
    cd $i
    python clientproxy.py -b --tegra=`pwd`
    cd ..
  done

when that should be --tegra=$i

However, we really need a way to start these guys up... I rebooted foopy01 to clear any old crufty processes and had to write this for loop while trying to debug breakage.

I've fixed up foopy01 (reboot, restart clientproxy correctly, verify that the buildslaves start automatically) since there was a lull in test runs.

>  b) I don't trust the callback server at all at this point (see reboot.py,
> below).

The callback server had a bug -- the global variable needed to be reset every run, but was only set at first run.  Clint has a patch for devicemanager coming, which should fix installApp.py and reboot.py without changes.

Some of my changes may still be wanted for some future version of the scripts (e.g. pulling functions out so they can be shared across scripts, guessing the tegra name so it's easier to call the scripts from manual command line) but if we can get to working with a small delta, I'm perfectly happy to leave things the way they are for now.
(In reply to comment #7)
> However, we really need a way to start these guys up...

This may just mean having a start_cps.sh that does the above [corrected] for loop.

Currently foopy01 has the patched sut_tools, to avoid the devicemanager callback server.  Once we get the devicemanager patch, we can revert the sut_tools patch and apply the devicemanager patch and things should still work.
This patch contains all of aki's changes from attachment 519277 [details] [diff] [review].  It also includes the fix for the callbackserver (we didn't reset gCallbackData so if we re-instantiated the callbackServer object, we'd blow right through our wait loop, causing the device and the devicemanager.py to get out of sync).  Also, while debugging this I found that the processExist code has not been updated to account for the env string being the first member of the string that is passed to processExist. So I changed that so that it properly looks for the application name in the second element of the string passed to processExist (processExist is passed exactly the command line we send to the agent, so it has the form "<envstring> <app> <redirect>").

Running a burn in test now, using the same apk file and the same port number.  Will attach my test.
Attachment #519277 - Attachment is obsolete: true
Attachment #519566 - Flags: review?(jmaher)
Attached file The test
This is the test.  Needs a little cleanup to run in our test harness.
Comment on attachment 519566 [details] [diff] [review]
Patch with Aki's changes and a fix for the callback server


>     if len(parts[0].strip('"').split('=')) > 1:
>       appname = ' '.join(parts[1:])
>   
>     pieces = appname.split(' ')
>-    parts = pieces[0].split('/')
>+    # We need pieces[1] because pieces[0] is now the env string
>+    parts = pieces[1].split('/')
>     app = parts[-1]
>     procre = re.compile('.*' + app + '.*')

This pieces[0] has worked for thousands of tests, I think we are fixing the wrong thing.  I bet in the scenario where we have a "" for env, we end up with an extra ' ' at the front of the appname which causes us to screw up on split(' ').  I propose something more like:
pieces = appname.strip().split(' ') <- assuming strip() removed leading whitespace.


>+      else:
>+        print '.',
>       time.sleep(step)
>       t += step

all print statements should have a if self.debug clause.  This should probably be a self.debug >= 2 condition.

>     try:
>       if (self.debug >= 3): print "Shutting down server now"
>       self.server.shutdown()
>     except:
>       print "Unable to shutdown callback server - check for a connection on port: " + str(self.port)
>     return gCallbackData

same here about the print and self.debug.

> 
>   class myhandler(SocketServer.BaseRequestHandler):
>     def handle(self):
>       global gCallbackData
>       gCallbackData = self.request.recv(1024)
>-      #print "Callback Handler got data: " + str(gCallbackData)
>+      print "Callback Handler got data: " + str(gCallbackData)
>       self.request.send("OK")

and here again.


r- only for the processExist stuff.  I will work on an updated patch here shortly.
Attachment #519566 - Flags: review?(jmaher) → review-
Yeah, the prints are leftover from my debugging.  We actually don't need them, but if we had the |print '.',| inside a |if self.debug > 3| that would help us make sure long waits don't hit buildbot timeouts.
(In reply to comment #7)
> Ok, now that things are looking like they're moving to resolution, a summary of
> where things are as I understand them.
> 
> (In reply to comment #2)
> > 1) I had failed to start the buildslaves with SUT_IP or SUT_NAME defined (and
> > check.py -r didn't start the buildslave at all).  So I wrote in support for
> > figuring this stuff out, though the Talos remotePerfConfigurator step still
> > needs SUT_IP.  We can probably just use the slave name property and get rid of
> > these env vars.
> 
> As Bear pointed out in IRC (though I argued rather loudly) I failed to do start
> clientproxy properly... I did a

no more loudly than I was also

the startup was on my todo list because it would be something that I don't expect a buildduty person to have to remember - I have a prototype ready that also does some other maintainance items but wasn't ready to test it yet.

> 
>   cd builds
>   for i in tegra-*; do
>     echo $i
>     cd $i
>     python clientproxy.py -b --tegra=`pwd`
>     cd ..
>   done
> 
> when that should be --tegra=$i
> 
> However, we really need a way to start these guys up... I rebooted foopy01 to
> clear any old crufty processes and had to write this for loop while trying to
> debug breakage.
> 
> I've fixed up foopy01 (reboot, restart clientproxy correctly, verify that the
> buildslaves start automatically) since there was a lull in test runs.
> 
> >  b) I don't trust the callback server at all at this point (see reboot.py,
> > below).
> 
> The callback server had a bug -- the global variable needed to be reset every
> run, but was only set at first run.  Clint has a patch for devicemanager
> coming, which should fix installApp.py and reboot.py without changes.
> 
> Some of my changes may still be wanted for some future version of the scripts
> (e.g. pulling functions out so they can be shared across scripts, guessing the
> tegra name so it's easier to call the scripts from manual command line) but if
> we can get to working with a small delta, I'm perfectly happy to leave things
> the way they are for now.

looking at your changes on the plane I love the way they were heading - I was being very conservative in changing how sut_tools worked (for silly bear reasons in retrospect).

I would love to continue with any change you started if the end goal is making things easier and simpler for buildduty.
updated the patch to add some smarts into the processExists.  Tested with mochitest and reftest as well as some hand testing of a few bogus appname variables.
Attachment #519566 - Attachment is obsolete: true
Attachment #519587 - Flags: review?(ctalbert)
The unit tests on test-master01 are running but purple due to reboot.py disconnects.  Either clientproxy is killing buildbot or something else is up.
A fix may be changing the reboot step in the factory to use DisconnectStep.
Attachment #519587 - Attachment is obsolete: true
Attachment #519587 - Flags: review?(ctalbert)
Attachment #519707 - Flags: review?(ctalbert)
Attachment #519707 - Flags: feedback?(aki)
Comment on attachment 519707 [details] [diff] [review]
updated patch for r- comments and reverting reboot() (2.0)

I'd love a

if (self.debug >= 3):
  print ".",

somewhere in the while (count < timeout): loop, so we can extend past the buildbot timeout.

I'm wondering why you're doing a global gCallbackData and not self.gCallbackData ?  Do we access gCallbackData by other means than either callbackServer or the return value from callbackServer.disconnect() ?

But this looks like it'll work for us, even without the above.
Attachment #519707 - Flags: feedback?(aki) → feedback+
this is a second option to reboot(wait=True).  This puts a file in the sutagent folder which upon reboot forces a callback.  No agent change needed.  Tested in a loop locally as well as one at a time.  I prefer this because it uses callback for all reboots we do and no polling.
Attachment #519732 - Flags: review?(ctalbert)
added the ',' to the end of print '.' :)
Attachment #519732 - Attachment is obsolete: true
Attachment #519732 - Flags: review?(ctalbert)
Attachment #519736 - Flags: review?(ctalbert)
Attachment #519736 - Flags: feedback?(aki)
Comment on attachment 519736 [details] [diff] [review]
updated with hack to reboot to force callback server (1.1)

>+      else:
>+        if (self.debug >= 3): print '.'

No comma! D:
Comment on attachment 519732 [details] [diff] [review]
updated with hack to reboot to force callback server (1.0)

>diff --git a/build/mobile/devicemanager.py b/build/mobile/devicemanager.py
>--- a/build/mobile/devicemanager.py
>+++ b/build/mobile/devicemanager.py
>@@ -567,21 +567,24 @@ class DeviceManager:
>   # iterates process list and returns pid if exists, otherwise None
>   # external function
>   # returns:
>   #  success: pid
>   #  failure: None
>   def processExist(self, appname):
>     pid = None
> 
>-    #remove the environment variables in the cli if they exist
>+    #filter out extra spaces
>     parts = filter(lambda x: x != '', appname.split(' '))
>+    appname = ' '.join(parts[:])
> 
>-    if len(parts[0].strip('"').split('=')) > 1:
>-      appname = ' '.join(parts[1:])
>+    #filter out the quoted env string if it exists
>+    parts = appname.split('"')
>+    if (len(parts) > 2):
>+      appname = ' '.join(parts[2:]).strip()
This looks a lot better than my change. Nice.
>@@ -1284,42 +1291,47 @@ class DeviceManager:
> 
> gCallbackData = ''
> 
> class myServer(SocketServer.TCPServer):
>   allow_reuse_address = True
> 
> class callbackServer():
>   def __init__(self, ip, port, debuglevel):
>+    global gCallbackData
>     self.ip = ip
>     self.port = port
>     self.connected = False
>     self.debug = debuglevel
>     if (self.debug >= 3) : print "Creating server with " + str(ip) + ":" + str(port)
>     self.server = myServer((ip, port), self.myhandler)
>     self.server_thread = Thread(target=self.server.serve_forever) 
>     self.server_thread.setDaemon(True)
>     self.server_thread.start()
>+    gCallbackData = ''
Put this ^ right after the 'global gCallbackData' up at the top of this function.  Putting it after thread.start() gives us a very slight race condition.
> 
>   def disconnect(self, step = 60, timeout = 600):
>+    global gCallbackData
You don't need this ^ statement because you're not writing to the gCallbackData variable in this function, you can remove it.
>     t = 0
>     if (self.debug >= 3): print "Calling disconnect on callback server"
>     while t < timeout:
>       if (gCallbackData):
>         # Got the data back
>         if (self.debug >= 3): print "Got data back from agent: " + str(gCallbackData)
>         break
>+      else:
>+        if (self.debug >= 3): print '.'
>       time.sleep(step)
>       t += step
> 
>     try:
>       if (self.debug >= 3): print "Shutting down server now"
>       self.server.shutdown()
>     except:
>-      print "Unable to shutdown callback server - check for a connection on port: " + str(self.port)
>+      if (self.debug >= 1): print "Unable to shutdown callback server - check for a connection on port: " + str(self.port)
>     return gCallbackData
> 
>   class myhandler(SocketServer.BaseRequestHandler):
>     def handle(self):
>       global gCallbackData
>       gCallbackData = self.request.recv(1024)
>       #print "Callback Handler got data: " + str(gCallbackData)
>       self.request.send("OK")
Attachment #519732 - Flags: review+
Comment on attachment 519736 [details] [diff] [review]
updated with hack to reboot to force callback server (1.1)

>diff --git a/build/mobile/devicemanager.py b/build/mobile/devicemanager.py
>--- a/build/mobile/devicemanager.py
>+++ b/build/mobile/devicemanager.py
>@@ -567,21 +567,24 @@ class DeviceManager:
>   # iterates process list and returns pid if exists, otherwise None
>   # external function
>   # returns:
>   #  success: pid
>   #  failure: None
>   def processExist(self, appname):
>     pid = None
> 
>-    #remove the environment variables in the cli if they exist
>+    #filter out extra spaces
>     parts = filter(lambda x: x != '', appname.split(' '))
>+    appname = ' '.join(parts[:])
> 
>-    if len(parts[0].strip('"').split('=')) > 1:
>-      appname = ' '.join(parts[1:])
>+    #filter out the quoted env string if it exists
>+    parts = appname.split('"')
>+    if (len(parts) > 2):
>+      appname = ' '.join(parts[2:]).strip()
This looks much better than my change.

>@@ -1284,42 +1291,47 @@ class DeviceManager:
> 
> gCallbackData = ''
> 
> class myServer(SocketServer.TCPServer):
>   allow_reuse_address = True
> 
> class callbackServer():
>   def __init__(self, ip, port, debuglevel):
>+    global gCallbackData
>     self.ip = ip
>     self.port = port
>     self.connected = False
>     self.debug = debuglevel
>     if (self.debug >= 3) : print "Creating server with " + str(ip) + ":" + str(port)
>     self.server = myServer((ip, port), self.myhandler)
>     self.server_thread = Thread(target=self.server.serve_forever) 
>     self.server_thread.setDaemon(True)
>     self.server_thread.start()
>+    gCallbackData = ''
Move this ^ up to just after the 'global gCallbackData' line at the top of this function.  Putting it after thread.start() causes a small race condition.

> 
>   def disconnect(self, step = 60, timeout = 600):
>+    global gCallbackData
Don't need this ^ cause you're not writing to gCallbackData in this function anymore.

r+ with those changes.  Looks really good and is testing well on my tegra.
Attachment #519736 - Flags: review?(ctalbert) → review+
Comment on attachment 519707 [details] [diff] [review]
updated patch for r- comments and reverting reboot() (2.0)

I think I'd rather not use the polling because that hasn't been sufficient in the past, so let's go forward with the callbackserver approach for reboot. as done in the newer patch.
Attachment #519707 - Flags: review?(ctalbert) → review-
fixed nits, carryover r+ from previous patch.
Assignee: aki → jmaher
Attachment #519707 - Attachment is obsolete: true
Attachment #519736 - Attachment is obsolete: true
Attachment #519736 - Flags: feedback?(aki)
Attachment #519750 - Flags: review+
Ok. We're getting to the next error, which is a good sign.
Looks like we're installing, getting a callback, returning the correct status, and then installApp.py tries to push app.ini and dies:

Shutting down server now
INFO: updateApp: got status back: update started org.mozilla.fennec /data/data/com.mozilla.SUTAgentAndroid/files/tests/fennec-4.0b6pre.multi.eabi-arm.apk
uninst complete [1]
install complete [1]
Success

updateApp() call returned update started org.mozilla.fennec /data/data/com.mozilla.SUTAgentAndroid/files/tests/fennec-4.0b6pre.multi.eabi-arm.apk
uninst complete [1]
install complete [1]
Success

in push file with: ../talos-data/fennec/application.ini, and: /data/data/org.mozilla.fennec/application.ini
reconnecting socket
remote hash returned: 'd41d8cd98f00b204e9800998ecf8427e'
local hash returned: '014f44856c68554c84c46d1b7e512c12'
failed making directory: /data/data/org.mozilla.fennec
unable to make dirs: /data/data/org.mozilla.fennec/application.ini
/builds/tegra-078/talos-data/../error.flg
Remote Device Error: unable to push ../talos-data/fennec/application.ini


I don't know why this is failing since we haven't touched the pushFile portion.
Some guesses:

1) The callback happens when the device is online, but it's not ready to take any commands yet. I've inserted a time.sleep(60) at the end of callbacksvr::disconnect() right before the return, but this isn't helping.  Either that's not it or it needs a longer sleep.

2) The processExists ? function wouldn't be trying to kill something with an argument that contains org.mozilla.fennec correct?  (Pretty sure this isn't it)

3) Some weird permission issue
foopy04:builds cltbld$ ls tegra-081/talos-data/fennec
META-INF		defaults		platform.ini
chrome			greprefs.js		plugin-container
chrome.manifest		lib			res
classes.dex		lib.id			resources.arsc
components		modules			update.locale

Or application.ini is missing =P
Oh, this may be us cleaning up the directory.
Stopping work on this to build rc1.
Not sure where things are atm, but it seemed like updateApp wasn't installing fennec at all, hence the pushFile error.

Quick thoughts:

1) To investigate: OSX firewall settings on foopy0{1..4}

2) Question: Do we believe that the callback server is a superior solution to polling?  If so, why?
(In reply to comment #29)
> Not sure where things are atm, but it seemed like updateApp wasn't installing
> fennec at all, hence the pushFile error.

from the outside It does appear that something is not right with updateApp.

we tried install, reboot, update, reboot and also uninstall, reboot, update and some other combinations but what is odd is that jmaher is able to use the sut_tools installApp scripts with the current patched devicemanager and have it work so we need to get tegra-002 over to bmoss (or ateam) so that they can confirm it is setup properly.

> 
> Quick thoughts:
> 
> 1) To investigate: OSX firewall settings on foopy0{1..4}
> 
> 2) Question: Do we believe that the callback server is a superior solution to
> polling?  If so, why?

callback is preferred, IMO, because it's discrete and immediate - with polling you loops and delays and flags/variables to remember what the state is/was in other parts of the loop.

not to say if polling works right now we shouldn't use it with an eye towards moving back to callbacks later
(In reply to comment #30)
> callback is preferred, IMO, because it's discrete and immediate - with polling
> you loops and delays and flags/variables to remember what the state is/was in
> other parts of the loop.
> 
> not to say if polling works right now we shouldn't use it with an eye towards
> moving back to callbacks later

Callback has many moving parts, and requires both sides to be set up correctly at the same time.  If one side isn't set up correctly, or the message comes when the listener isn't listening, it's gone forever and we have no logs to help us.

If we poll, one side is active and the other listens on the 20701 port it always listens on. We can log to our hearts' content, and if we miss a poll we can always wait and try again, or log the reason why the poll failed.

What do we gain by having a temporal callback that has no logging attached except "waiting for message that isn't coming" ?
(In reply to comment #31)
> (In reply to comment #30)
> > callback is preferred, IMO, because it's discrete and immediate - with polling
> > you loops and delays and flags/variables to remember what the state is/was in
> > other parts of the loop.
> > 
> > not to say if polling works right now we shouldn't use it with an eye towards
> > moving back to callbacks later
> 
> Callback has many moving parts, and requires both sides to be set up correctly
> at the same time.  If one side isn't set up correctly, or the message comes
> when the listener isn't listening, it's gone forever and we have no logs to
> help us.
> 
> If we poll, one side is active and the other listens on the 20701 port it
> always listens on. We can log to our hearts' content, and if we miss a poll we
> can always wait and try again, or log the reason why the poll failed.
> 
> What do we gain by having a temporal callback that has no logging attached
> except "waiting for message that isn't coming" ?

The only thing that dialback callback gave me that polling couldn't was notification when a reboot happened that was not planned - that shortened the time the code was sitting in an event loop waiting for something that never would have happened.  This is what is making me want to have callback available even if it isn't the primary method.
(In reply to comment #32)
> The only thing that dialback callback gave me that polling couldn't was
> notification when a reboot happened that was not planned - that shortened the
> time the code was sitting in an event loop waiting for something that never
> would have happened.  This is what is making me want to have callback available
> even if it isn't the primary method.

If polling is smart, it can do |info uptime| as well as checking devRoot.
I'm aware it's slower by design, but it's also less fragile by design.
I will take that tradeoff in a heartbeat.
No longer all purple.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: