Closed Bug 1161631 Opened 10 years ago Closed 10 years ago

Autophone - implement commands to support Autophone as a service and dynamic control of devices.

Categories

(Testing Graveyard :: Autophone, defect)

defect
Not set
normal

Tracking

(firefox40 affected)

RESOLVED FIXED
Tracking Status
firefox40 --- affected

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(14 files, 1 obsolete file)

8.23 KB, text/plain
gbrown
: feedback+
Details
2.81 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
744 bytes, patch
gbrown
: review+
Details | Diff | Splinter Review
2.73 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
1.48 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
42.98 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
89.91 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
17.94 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
13.84 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
15.39 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
2.18 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
11.40 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
1.11 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
3.31 KB, patch
gbrown
: review+
Details | Diff | Splinter Review
Currently there is no clean way to stop Autophone while preserving the job queue. In order to better manage running instances of Autophone as well as support the necessary features for it to run as a service and recover from usb errors we need to implement the following commands. pause <device> - this will cause a worker to enter a paused state after completing the currently running test. pauseall - issue a pause <device> command to each device. shutdown - issue a pauseall command and when all devices are paused, shutdown autophone. restart - issue a pauseall command and when all devices are paused, restart autophone.
Summary: Autophone - implement pause, shutdown, restart commands → Autophone - implement commands to support Autophone as a service and dynamic control of devices.
Attached file autophone-commands.org
gbrown: I would appreciate your feedback.
Attachment #8606344 - Flags: review?(gbrown)
Attachment #8606344 - Flags: review?(gbrown) → feedback?(gbrown)
Are you going to keep 'stop' once 'shutdown' is available? I would prefer just one shutdown command - simple, easy to remember - but there may be a need for ungraceful shutdown. I could envision initiating a shutdown in hopes of a graceful shutdown in a timely manner, then getting frustrated by a need to wait for a lengthy job to complete. Can you support an immediate stop command during shutdown? An alternative might be a shutdown-with-timeout command: try to shutdown gracefully, but force things down if it is taking too long; I have concerns about this idea and am not sure I like it, just thought I'd mention it. Personally, I would not use 'restart' -- I would not trust it. I would prefer to shutdown autophone and start it again: Then I *know* that there's a new process that can't possibly be holding old configuration and starts in a state that I can reproduce. I suppose the downside of that is that I would need to wait around for shutdown and then restart manually, but I think that's okay. What happens if a device is added, reconfigured, etc, but the updated configuration is invalid (cannot be parsed, or is illogical, etc)? Does autophone currently treat bad configuration as a fatal error?
Comment on attachment 8606344 [details] autophone-commands.org Good ideas here, described well!
Attachment #8606344 - Flags: feedback?(gbrown) → feedback+
(In reply to Geoff Brown [:gbrown] from comment #3) > Are you going to keep 'stop' once 'shutdown' is available? I would prefer > just one shutdown command - simple, easy to remember - but there may be a > need for ungraceful shutdown. > Yes, I think it is a good back up to the shutdown in case it is needed. > I could envision initiating a shutdown in hopes of a graceful shutdown in a > timely manner, then getting frustrated by a need to wait for a lengthy job > to complete. Can you support an immediate stop command during shutdown? > Yes. > An alternative might be a shutdown-with-timeout command: try to shutdown > gracefully, but force things down if it is taking too long; I have concerns > about this idea and am not sure I like it, just thought I'd mention it. > Yeah, I'm not sure about this either. For some of the long running tests, for example mochitest-*, the test may take a very long time to complete. > Personally, I would not use 'restart' -- I would not trust it. I would > prefer to shutdown autophone and start it again: Then I *know* that there's > a new process that can't possibly be holding old configuration and starts in > a state that I can reproduce. I suppose the downside of that is that I would > need to wait around for shutdown and then restart manually, but I think > that's okay. > I guess I was thinking of exposing this to the service but I can implement restart in the service by stop, start. > What happens if a device is added, reconfigured, etc, but the updated > configuration is invalid (cannot be parsed, or is illogical, etc)? Does > autophone currently treat bad configuration as a fatal error? There isn't anything that enforces the manifest format. In many places, if sections or options aren't available default values will be used instead. In the unit test's runtestsremote.py it will raise an exception which will end up showing as an exception in Treeherder but it won't kill the worker or the overall autophone process. I'll see what I can think of for this in this particular case as well as in general.
gbrown: I have a number of small bug fixes I'd like to bundle with this bug. I'll attach them as patches first so you can review them quickly. The remainder of the patches will be much larger unfortunately.
When I realized that I could use the arrow keys in Treeherder to very quickly navigate between jobs in order to cancel them, I found that if all of the jobs for a build were canceled and a new job was immediately available, it was possible for the PhoneTest.test_result to carry over the PhoneTestResult.USERCANCEL value which would cause the next job to be skipped and its status not updated on Treeherder.
Attachment #8612905 - Flags: review?(gbrown)
If the check_battery call times out, the test's setup_job method won't be called but the teardown_job will. This makes sure the setup is called before the battery check.
Attachment #8612906 - Flags: review?(gbrown)
In worker.py, we end up initializing ADBDevice for the PhoneWorkerSubProcess instance and for each PhoneTest instance used by the worker. This is redundant and causes many extra ADBDevice initializations. This patch forwards PhoneTest's dm property to the PhoneWorkerSubProcess's. There is a similar problem in autophone.py. I'll deal with it in a later patch.
Attachment #8612909 - Flags: review?(gbrown)
I found that it was possible for the PhoneTest.dm access to timeout which prevented the PhoneTest from completing setup_job which left stop_time as None when tearing down the job. This problem has probably been resolved by the new patch which uses the PhoneWorkerSubProcess' dm, but during development and before ordering the patches, this patch fixed the symptom by moving the dm accesses to the end of the setup_job method.
Attachment #8612913 - Flags: review?(gbrown)
Attachment #8612905 - Flags: review?(gbrown) → review+
Attachment #8612906 - Flags: review?(gbrown) → review+
Comment on attachment 8612909 [details] [diff] [review] bug-1161631-patch-3-v1.patch Review of attachment 8612909 [details] [diff] [review]: ----------------------------------------------------------------- Nice. Much simpler.
Attachment #8612909 - Flags: review?(gbrown) → review+
Attachment #8612913 - Flags: review?(gbrown) → review+
Attached patch bug-1161631-patch-5-v1.patch (obsolete) — Splinter Review
Changes to support autophone-shutdown. This patch exposes and issue with the logging, in particular logging from external libraries, which can cause hangs. Logging is extensively reworked in the next patch. Still need to finish implementing the remainder of the commands.
Attachment #8613068 - Flags: review?(gbrown)
One line change to actually use an RLock as mentioned in the commit message.
Attachment #8613068 - Attachment is obsolete: true
Attachment #8613068 - Flags: review?(gbrown)
Attachment #8613301 - Flags: review?(gbrown)
This reorganizes the logging pretty substantially. For the most part is consists of removal of the multiprocessing logging handlers and changes to using a file global logger instance followed by changes s/self.logger/logger/. The other part of this was an attempt to resolve hangs I was seeing. It turns out that this may not be necessary and is definitely not sufficient to eliminate the hang but I've come to like this approach better. It iterates over the existing logger instances in the main process and the worker processes and initializes their handlers and adds a filter which checks for the existence of sensitive data in the log message and prevents the message from being logged if it does contain sensitive data. Another change is the movement of some of the PhoneWorkerSubProcess initialization off of the main process and onto the worker's process in the run method.
Attachment #8613411 - Flags: review?(gbrown)
This patch introduces a shared process lock which is used to serialize the pulsemonitor handlers and the treeherder post which appears to be the cause of the hangs I have been seeing. This is to be applied after the patch in bug 1160188 comment 3
Attachment #8613415 - Flags: review?(gbrown)
This patch implements help and device commands. The device-stop may be sufficient for our restart with new configuration, but I haven't tested that use case where the worker is started, the configuration changed, then the worker stopped and started.
Attachment #8613417 - Flags: review?(gbrown)
(In reply to Bob Clary [:bc:] from comment #14) > Created attachment 8613411 [details] [diff] [review] > bug-1161631-patch-6-v1.patch I wanted to let you know that I tried putting that handler/filter code into an external function in utils.py, but I couldn't get it to work. Either I was doing it wrong or the import was messing me up. Just an fyi in case you suggest it in the review.
(In reply to Bob Clary [:bc:] from comment #16) > Created attachment 8613417 [details] [diff] [review] > bug-1161631-patch-8-v1.patch > > This patch implements help and device commands. The device-stop may be > sufficient for our restart with new configuration, but I haven't tested that > use case where the worker is started, the configuration changed, then the > worker stopped and started. gbrown: FYI, This was a first stab at getting the dispatching from AutoPhone to PhoneWorker to PhoneWorkerSubProcess working. I plan on modifyging the device-stop command to to perform two functions: 1. Will perform like device-shutdown and remove the device after it stops. 2. Will perform as is now and will be called device-restart. It will effectively call PhoneWorker.stop() which will kill the PhoneWorkerSubProcess and the check_for_dead_workers will resurrect it.
Comment on attachment 8613301 [details] [diff] [review] bug-1161631-patch-5-v2.patch Review of attachment 8613301 [details] [diff] [review]: ----------------------------------------------------------------- ::: process_states.py @@ +7,5 @@ > + STARTING = 'STARTING' > + RUNNING = 'RUNNING' > + SHUTTINGDOWN = 'SHUTTINGDOWN' > + SHUTDOWN = 'SHUTDOWN' > + STOPPING = 'STOPPING' This might be a good place for some one-line comments describing the states. It's mostly obvious, but you could imagine confusion over eg SHUTTINGDOWN vs STOPPING.
Attachment #8613301 - Flags: review?(gbrown) → review+
Comment on attachment 8613411 [details] [diff] [review] bug-1161631-patch-6-v1.patch Review of attachment 8613411 [details] [diff] [review]: ----------------------------------------------------------------- Nice. Looking good!
Attachment #8613411 - Flags: review?(gbrown) → review+
Attachment #8613415 - Flags: review?(gbrown) → review+
Comment on attachment 8613417 [details] [diff] [review] bug-1161631-patch-8-v1.patch Review of attachment 8613417 [details] [diff] [review]: ----------------------------------------------------------------- ::: autophone.py @@ +236,5 @@ > + if worker.crashes.too_many_crashes(): > + initial_state = PhoneStatus.DISABLED > + msg_subj += ' and was disabled' > + msg_body += 'It looks really crashy, so I disabled it. ' \ > + 'Sorry about that.\n' Cute! @@ +440,5 @@ > +autophone-status > + Generate a status report for each device. > + > +autophone-shutdown > + Shutdown each worker after it's current test, then s/it's/its/ @@ +452,5 @@ > + Check if the device's worker process is alive, report to log. > + > +device-stop <device> > + Immediately stop the device's worker process. The worker will be > + automatically restarted. restarted immediately, or when needed? ::: worker.py @@ +311,4 @@ > del self.p > self.phone_status = phone_status > self.p = multiprocessing.Process(target=self.run, name=self.phone.id) > + #self.p.daemon = True What's happening here?
Attachment #8613417 - Flags: review?(gbrown) → review+
(In reply to Geoff Brown [:gbrown] from comment #21) > Comment on attachment 8613417 [details] [diff] [review] > bug-1161631-patch-8-v1.patch > > Cute! I think Mark did that ages ago. > > > @@ +452,5 @@ > > + Check if the device's worker process is alive, report to log. > > + > > +device-stop <device> > > + Immediately stop the device's worker process. The worker will be > > + automatically restarted. > > restarted immediately, or when needed? It will be restarted as part of the check for dead workers during the message loop processing. > > ::: worker.py > @@ +311,4 @@ > > del self.p > > self.phone_status = phone_status > > self.p = multiprocessing.Process(target=self.run, name=self.phone.id) > > + #self.p.daemon = True > > What's happening here? Ah. Left that comment in, but I removed the daemon setting on the worker since multiprocessing docs state that the child process should not create any new processes if it is a daemon. Since the worker will create multiple child processes for the adb calls, I removed it.
part 4 -implement autophone-restart, autophone-add-device, device-restart commands, make device-stop remove device. Note the timeout=60 on ADBDevice is removed in a later patch. I meant it to only be used initially in AutoPhone but when I consolidated the ADBDevice instances it caused problems in the PhoneWorkerSubProcess. Note that restarting a device will pick up changes in the test manifest and test config files but not in the overall autophone configuration file if you are using that. To change the autophone configuration, you need to restart autophone. note that errors in the test manifests will cause Autophone to exit.
Attachment #8615807 - Flags: review?(gbrown)
add comments to process_states, eliminate an unused state and set the Phone worker initial state to STARTING.
Attachment #8615826 - Flags: review?(gbrown)
part 6 - reuse ADBDevice instances. As I mentioned earlier with regard to the ADBDevice instances in the phone tests, this eliminates the duplicate creation of ADBDevice instances in the main Autophone process and shares a single instance amongst all tests for a given device. This also removes that timeout=60 I mentioned earlier.
Attachment #8615829 - Flags: review?(gbrown)
part 2 - part duh - do not attempt to get treeherder credentials if treeherder is not being used.
Attachment #8615831 - Flags: review?(gbrown)
part 7 - document commands and add script ap.sh to control Autophone
Attachment #8615832 - Flags: review?(gbrown)
Overall I am quite happy with the flexibility this gives to managing a running instance of Autophone. The only downside is with autophone-restart when if you change the test manifest to remove tests for a device which are pending and already in the jobs.sqlite database, then they will be quietly dropped after autophone restarts and left in the pending state in treeherder. I can live with it.
fyi, I realized an easy way to cancel tests that have been queued but removed from the test manifest. I'll do a follow up patch.
Actually no it isn't easy. I totally forgot what I knew last night, that without the test objects I can't submit to treeherder. So, not going to happen.
Comment on attachment 8615807 [details] [diff] [review] bug-1161631-patch-9-v1.patch Review of attachment 8615807 [details] [diff] [review]: ----------------------------------------------------------------- ::: autophone.py @@ +350,5 @@ > + # is only run during a program reload. > + from fcntl import fcntl, F_GETFD, F_SETFD, FD_CLOEXEC > + for fd in xrange(0x3, 0x10000): > + try: > + fcntl(fd, F_SETFD, fcntl(fd, F_GETFD) | FD_CLOEXEC) Cool. (And a little scary.)
Attachment #8615807 - Flags: review?(gbrown) → review+
Comment on attachment 8615826 [details] [diff] [review] bug-1161631-patch-10-v1.patch Review of attachment 8615826 [details] [diff] [review]: ----------------------------------------------------------------- ::: process_states.py @@ +8,5 @@ > + of Autophone in response to received commands. > + > + ProcessStates are used in AutoPhone, PhoneWorker and > + PhoneWorkerSubProcess. When an AutoPhone, PhoneWorker or > + PhoneWorkerSubProcess object is initialized, the objects `state` s/objects/object's/ @@ +28,5 @@ > + > + STOPPING is used for AutoPhone and PhoneWorker when they are in > + the process of stopping immediately with out waiting for the > + currently running tests to complete. > + """ This whole block is clear, concise, and helpful. Nice.
Attachment #8615826 - Flags: review?(gbrown) → review+
Comment on attachment 8615829 [details] [diff] [review] bug-1161631-patch-11-v1.patch Review of attachment 8615829 [details] [diff] [review]: ----------------------------------------------------------------- That feels much cleaner. ::: worker.py @@ -283,3 @@ > self.phone_status = None > self.filehandler = None > - self.dm = None chuckle
Attachment #8615829 - Flags: review?(gbrown) → review+
Attachment #8615831 - Flags: review?(gbrown) → review+
Comment on attachment 8615832 [details] [diff] [review] bug-1161631-patch-13-v1.patch Review of attachment 8615832 [details] [diff] [review]: ----------------------------------------------------------------- ::: USAGE.md @@ +710,5 @@ > +## Controlling Autophone > + > +Autophone listens by default on port 28001 for commands. You can send > +commands to Autophone using the script ap.sh. You will need the > +program ncat installed. ncat or netcat or nc? On my ubuntu, netcat and nc point to the same binary. There is no ncat and the system suggests I install nmap to get it...which seems unlikely. @@ +727,5 @@ > + autophone-help > + Generate this message. > + > + autophone-add-device <name> <serialno> > + There's an 'extra' newline here. @@ +745,5 @@ > + autophone-stop > + Immediately stop autophone and all worker processes; may be > + delayed by pending download. > + > + device-debug <device> <level> I don't want to clutter the description, but maybe it would be helpful to say what <device> is and what the possible <level> values are? @@ +778,5 @@ > + device's worker process will not be restarted and will be removed > + from the active list of workers. > + > + device-stop <device> > + Another newline.
Attachment #8615832 - Flags: review?(gbrown) → review+
(In reply to Geoff Brown [:gbrown] from comment #31) > Comment on attachment 8615807 [details] [diff] [review] > bug-1161631-patch-9-v1.patch > > Review of attachment 8615807 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: autophone.py > Cool. (And a little scary.) Yeah, I did this for Bughunter years ago and it works pretty well. (In reply to Geoff Brown [:gbrown] from comment #34) > Comment on attachment 8615832 [details] [diff] [review] > bug-1161631-patch-13-v1.patch > > + > > +Autophone listens by default on port 28001 for commands. You can send > > +commands to Autophone using the script ap.sh. You will need the > > +program ncat installed. > > ncat or netcat or nc? On my ubuntu, netcat and nc point to the same binary. > There is no ncat and the system suggests I install nmap to get it...which > seems unlikely. On fedora they are the same too, but is installed via nmap-ncat. The man page says it is part of nmap though. I'll say: Autophone listens by default on port 28001 for commands. You can send commands to Autophone using the script ap.sh. You will need the program ncat/nc installed. On Fedora ncat can be installed via `sudo yum install nmap-ncat` and on Ubuntu it can be installed via `sudo apt install nmap`. > > @@ +727,5 @@ > > + autophone-help > > + Generate this message. > > + > > + autophone-add-device <name> <serialno> > > + > > There's an 'extra' newline here. > > @@ +745,5 @@ > > + autophone-stop > > + Immediately stop autophone and all worker processes; may be > > + delayed by pending download. > > + > > + device-debug <device> <level> > > I don't want to clutter the description, but maybe it would be helpful to > say what <device> is and what the possible <level> values are? > sure. I'll change it to <devicename> everywhere and include an explanation under autophone-add-device. Actually the debug stuff is left over from devicemanager days and doesn't really have the intended effect. I'm just going to kill it. > @@ +778,5 @@ > > + device's worker process will not be restarted and will be removed > > + from the active list of workers. > > + > > + device-stop <device> > > + > > Another newline. I see the newlines in the patch but not the file. hmm.
Blocks: 1172329
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: