verify script called by mozharness for android devices doesn't reboot via mozpool

RESOLVED INVALID

Status

Release Engineering
Mozharness
RESOLVED INVALID
4 years ago
2 years ago

People

(Reporter: kmoir, Assigned: kmoir)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Assignee)

Description

4 years ago
The mozharness scripts for android makes a call to the sut_tools/verify.py script.  This makes a call to sut reboot.  This could lead to problems where the reboot by the mozpool could occur after the sut reboot because the devices are in the ready state and managed by mozpool. The end result is that we could have unexpected reboots of devices.
(Assignee)

Comment 1

4 years ago
second sentence should have read

"This could lead to problems where a device rebooted by sut could be identified as a device down by mozpool and spuriously rebooted"
(Assignee)

Updated

4 years ago
Blocks: 829211

Comment 2

4 years ago
We probably should look into making verify to call mozpool's reboot calls only for pandas.

Any thoughts?
(In reply to Armen Zambrano G. [:armenzg] (Release Enginerring) from comment #2)
> We probably should look into making verify to call mozpool's reboot calls
> only for pandas.
> 
> Any thoughts?

This is exactly the point of this bug. We want to use the api for reboots for any device managed by mozpool.

Its not the case yet, but we can certainly make it be so. One blocker for the always nature is the fact that a locked_out device won't reboot if we ask mozpool to do it. (which means talos jobs won't reboot at present)

I also think this change should be seperately rolled out from a switch-talos-to-mozharness deploy
(Assignee)

Updated

4 years ago
Assignee: nobody → kmoir
Product: mozilla.org → Release Engineering
(Assignee)

Comment 4

4 years ago
There are two scripts that are invoked by the mozharness scripts verify.py and installApp.py.

verify.py from command line
calls verifyDevice
calls canPing -> reboot if cannot can ping
checksut -> calls updateSUTVersion -> calls updateSUT -> calls -> def doUpdate in updateSUT.py -> 
checkAndFixScreen -> calls soft_reboot (not called now)
cleanupDevice -> calls cleanupDevice in sutTools.py which sets  reboot_needed = True for pandas and calls soft_reboot_and_verify in powermanagement.py

installApp.py via command line main
-> calls installOneApp --> calls installApp --> calls _runCmds
 then finds robocop.apk and tries to install it too via installOneApp
 --then calls waitForDevice 
waitForDevice waits for the device to become available again after the install although I can't find a call to reboot the device

The other non-mozpool reboot is the count_and_reboot.py step outside of the mozharness script.  This should not be needed because the mozharness script calls the close_request action which returns the device to the mozpool.  If you set the reboot_command to a blank string in mobile_config.py this will stop these reboots.
PLATFORMS['android']['mozharness_config'] = {
    'mozharness_python': '/tools/buildbot/bin/python',
    'hg_bin': 'hg', 
    'reboot_command': "",
    'talos_script_maxtime': 10800,
Hey Kim,

Good thinking to highlight this - this could really help stability.

Also to note, there is a cronjob running on the foopies (/etc/cron.d/foopy) that runs /builds/watch_devices.sh every 5 minutes.

watch_devices.sh monitors each of the pandas on the foopies (see /builds/watcher.log and /builds/<panda>/watcher.log) which is also calling the following python code (in tools repo, not in mozharness):

/builds/sut_tools/verify.py
/builds/sut_tools/cleanup.py
/builds/sut_tools/stop.py

I think it would make sense at the same time to review what overlap exists between watch_devices.sh and their mozpool management, since this may also be a place where a conflict of interest could occur.

This code was written at a time when mozpool was not managing all pandas, and is also used by tegras, so a review might highlight functionality that can be disabled/removed, which is not appropriate for mozpool-managed devices.

I believe kitten herder has been disabled for pandas and tegras.
Not sure about other tools that might be watching pandas too (slave api, nagios, ...).
Are there any other buildbot steps that could be running, which might be rebooting pandas, or is that all migrated to mozharness now?

Maybe we should document all of this as a first step to validating the integrity of the current architecture.
(the reason i suggest we document it is simply that i'm not sure that anybody has a *full* overview of *all* systems that are potentially validating and rebooting devices - maybe if we capture this, we might realise there is redundancy in there, and maybe this might be a cause of twistd failures, disconnects, unexpected reboots etc, that we see)

Comment 7

4 years ago
The two components that you both mention should be the full picture.
I doubt that we're doing calls to slaveapi to reboot pandas. In any case, it would a releng running it manually.
(Assignee)

Comment 8

4 years ago
Created attachment 821709 [details] [diff] [review]
removes the final reboot via count_and_reboot.py outside mh script

since the mozharness script returns the device to the pool with the close_request action

tested in staging
Attachment #821709 - Flags: review?(armenzg)

Comment 9

4 years ago
Comment on attachment 821709 [details] [diff] [review]
removes the final reboot via count_and_reboot.py outside mh script

Review of attachment 821709 [details] [diff] [review]:
-----------------------------------------------------------------

I'm sure we would have caught this on staging if it had turned the step to orange:
http://buildbot-master29.build.scl1.mozilla.com:8201/builders/Android%204.0%20Panda%20mozilla-central%20talos%20remote-tspaint/builds/20/steps/reboot/logs/stdio

It seems we will now be able to save 1min per job!

FYI, I would have thought that None would have been better than '', however, I'm happy to know that it also works.

/tools/buildbot/bin/python scripts/external_tools/count_and_reboot.py -f ../reboot_count.txt -n 1 -z
 in dir /builds/panda-0127/test/. (timeout 1200 secs)
 watching logfiles {}
 argv: ['/tools/buildbot/bin/python', 'scripts/external_tools/count_and_reboot.py', '-f', '../reboot_count.txt', '-n', '1', '-z']
 environment:
  HOME=/home/cltbld
  LOGNAME=cltbld
  OLDPWD=/home/cltbld
  PATH=/usr/local/bin:/usr/local/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/cltbld/bin
  PROPERTIES_FILE=/builds/panda-0127/test/buildprops.json
  PWD=/builds/panda-0127/test
  PYTHONPATH=/builds/sut_tools
  SHELL=/bin/sh
  SHLVL=4
  SUT_IP=10.12.128.110
  SUT_NAME=panda-0127
  USER=cltbld
  _=/tools/buildbot/bin/python2.7
 using PTY: False
************************************************************************************************
*********** END OF RUN - NOW DOING SCHEDULED REBOOT; FOLLOWING ERROR MESSAGE EXPECTED **********
************************************************************************************************
[sudo] password for cltbld: 
Sorry, try again.
[sudo] password for cltbld: 
Sorry, try again.
[sudo] password for cltbld: 
Sorry, try again.
sudo: 3 incorrect password attempts
Attachment #821709 - Flags: review?(armenzg) → review+
(Assignee)

Comment 10

4 years ago
Comment on attachment 821709 [details] [diff] [review]
removes the final reboot via count_and_reboot.py outside mh script

None is better than an empty string
https://hg.mozilla.org/build/buildbot-configs/rev/bbcec3c6f785
Attachment #821709 - Flags: checked-in+
in production
(Assignee)

Comment 12

4 years ago
Dustin: Our mozpool scripts return the devices to the pool once they are done with them via close_request.  They used to have a relay reboot invoked via mozharness outside of mozpool after this step,  but I removed it last week.  Am I correct in my assumption that once the device the mozpool request is closed, no further action is required like a reboot?  I'm seeing some weird behaviour on some of our masters with pandas not attaching so was wondering what your thoughts were on this.
Flags: needinfo?(dustin)
That's correct.  You certainly shouldn't be doing anything (ping, SUT, reboot, etc.) to a device after releasing it or before successfully requesting it, nor should *anything* *at* *all* be talking to the relays.  So I'm glad you removed that :)
Flags: needinfo?(dustin)
(Assignee)

Comment 14

4 years ago
Callek: So the approach I have used to address the verify reboots is to not reboot the panda in cleanupDevice in cleanup.py and then reboot it via mozpool when verify.py returns.  I don't really like this approach because it is not elegant but I don't know of another way to implement it. Do you have any suggestions.  (I'm still working through some issues with this approach)
Flags: needinfo?(bugspam.Callek)
so my thought here...

When running verify.py inside mozharness I *believe* we have access to mozpoolclient module.

So we modify http://mxr.mozilla.org/build/source/tools/lib/python/sut_lib/powermanagement.py#20 to use mozpoolclient instead of talking to the relay.

We could even do a try: catch: on the import if we really want to be reliable here.

Alternatively we could reimpliment the mozpoolclient parts we care about and call them directly without caring if we're in mozharness or not.

We call softreboot in more places than JUST cleanup.py/verify.py during a harness run...  specifically config.py and installApp.py (http://mxr.mozilla.org/build/search?string=powermanagement&find=tools&findi=&filter=^[^\0]*%24&hitlimit=&tree=build)

-- Either approach requires us to NOT do any device manipulation outside of buildbot running (e.g. if we're a panda/device-in-mozpool we don't do anything unless we're running IN a buildbot job)
Flags: needinfo?(bugspam.Callek)
(Assignee)

Updated

4 years ago
Duplicate of this bug: 919533
(Assignee)

Comment 17

4 years ago
Working on this now and there are a lot of dependencies from mozharness and mozpoolclient that need to be pulled into to accomplish this.  Callek suggested that it might be a good idea to put these all into a separate module that could be installed in the venv and accessible via powermanagement.py
(Assignee)

Comment 18

4 years ago
Created attachment 827344 [details]
Testing script with issues

This is a script I was using on the foopy to test things before I ran into roadblocks and am now reconsidering the approach in comment #15.  My intention was to incorporate this bit of code into the soft_reboot in powermanagement.py.  The issue is that verify.py and installApp.py etc are invoked via a script so they don't have any access to the mozpool handler that is currently associated with that device.  I don't know if you can create a second mozpool handler to work with a device that already has one open with it and try to reboot the device via device_power_cycle in mozpoolclient.py.  

My next thoughts would be to import verify.py and installApp.py and invoke them that way instead of the command line.  This doesn't cover all of the times that soft_reboot is called but this could be refactored.  Or serialize the state of the mozpool handler objects, write it to a file and reload from within powermanagement.py by adding a command line argument when the scripts are invoked.  I don't know.  Basically it's a problem that the scripts are invoked versus imported because we want to access the mozpool handler.
Assignee: kmoir → bugspam.Callek
Blocks: 932231

Updated

3 years ago
Component: Platform Support → Mozharness
This is now very low priority, but looks like Kim has made the most recent efforts here.

It will also likely become easier to architect once tegras are disabled.

Either way it is still a valid bug in our current use case of pandas (we shouldn't be touching relay boards directly, and should ideally remove the need for devices.json or at least for devices.json to manage port/host/etc of the power units.)
Assignee: bugspam.Callek → kmoir
(Assignee)

Comment 20

2 years ago
This can be closed since Pandas are on there way out in bug 1186615
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.