Closed
Bug 795195
Opened 13 years ago
Closed 13 years ago
Workers don't stop crashing if ping causes crash
Categories
(Testing Graveyard :: Autophone, defect)
Testing Graveyard
Autophone
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mcote, Unassigned)
Details
Attachments
(2 files, 2 obsolete files)
|
23.45 KB,
patch
|
bc
:
review+
|
Details | Diff | Splinter Review |
|
2.60 KB,
patch
|
bc
:
review+
|
Details | Diff | Splinter Review |
The downside of the patch to continue to ping disabled devices is that the pings come entirely from the worker process, so if it dies and is restarted, it forgets the time of the last ping and issues another ping, which, if that triggers a crash, begins a vicious cycle. We either need to persist the last ping time somehow and/or cap the number of crashes that can happen within a certain time before the device is permanently disabled. We'd need another state for that, but that might be a good idea anyway. DISABLED and DISCONNECTED, maybe.
| Assignee | ||
Comment 1•13 years ago
|
||
This patch is not fully tested. I'm running it on my staging server here (2 phones), but I've only verified a couple of the new or modified code paths. I figured that, while it's baking, you can take a look as well to see if there are any obvious problems and/or to do some testing yourself. We should adopt the unit-test pattern from mozdevice to script some scenarios to test more of the failure paths.
This is, however, a stop-gap patch on the way towards extracting the device pool from autophone. I want to use (and thus help verify) the new DeviceManager-exceptions patch and introduce a new state that I think we'll need in the new pool module. As I am going to continue to extraction of the pool code, this patch probably won't be running for too long, so we don't need to be terribly thorough at the moment.
Also added a little convenience bit to the mailer so I can better filter my error emails.
| Assignee | ||
Updated•13 years ago
|
OS: Linux → All
Hardware: x86 → All
Comment 2•13 years ago
|
||
Comment on attachment 670685 [details] [diff] [review]
Add DISCONNECTED state, update for DM-exceptions change
Just a couple of comments. Still looking at the rest.
>diff --git a/worker.py b/worker.py
>...
>+class Crashes(object):
>+
>+ CRASH_WINDOW = datetime.timedelta(seconds=30)
>+ MAX_CRASHES = 5
>+
>+ def __init__(self):
>+ self.crash_times = []
>+
>+ def add_crash(self):
>+ self.crash_times.append(datetime.datetime.now())
>+ self.crash_times = [x for x in self.crash_times
>+ if self.crash_times[-1] - x <= self.CRASH_WINDOW]
>+
This only keeps the "crashes" that have occurred
in the last 30 seconds? Should be 30 minutes?
...
> def check_sdcard(self):
...
>- success = self.dm.pushFile(tmp.name, posixpath.join(d,
>- 'testfile'))
>+ success = True
>+ try:
>+ self.dm.removeDir(d)
>+ self.dm.mkDir(d)
>+ if self.dm.dirExists(d):
>+ with tempfile.NamedTemporaryFile() as tmp:
>+ tmp.write('autophone test\n')
>+ tmp.flush()
>+ self.dm.pushFile(tmp.name, posixpath.join(d, 'testfile'))
testfile seems so generic a name. I know it is an existing thing, but
perhaps a more descriptive name like sdcard_check or something?
>
Comment 3•13 years ago
|
||
Confirmed that CRASH_WINDOW of 30 seconds won't work to disable a phone but 30 minutes will.
Comment 4•13 years ago
|
||
When starting up with a phone in the cache but the phone is turned off I get:
2012-10-12T12:13:25 Worker starting up.
Could not connect; sleeping for 5 seconds.
Could not connect; sleeping for 10 seconds.
Could not connect; sleeping for 15 seconds.
Could not connect; sleeping for 20 seconds.
Could not connect; sleeping for 25 seconds.
Could not connect; sleeping for 30 seconds.
Could not connect; sleeping for 35 seconds.
Process Process-2:
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/work/mozilla/virtualenvwrapper/autophone/autophone/worker.py", line 346, in loop
if not self.check_sdcard():
File "/work/mozilla/virtualenvwrapper/autophone/autophone/worker.py", line 191, in check_sdcard
dev_root = self.dm.getDeviceRoot()
File "/work/mozilla/virtualenvwrapper/autophone/autophone/worker.py", line 151, in dm
retrylimit=8)
File "/work/mozilla/virtualenvwrapper/autophone/mozbase/mozdevice/mozdevice/devicemanagerSUT.py", line 38, in __init__
self.getDeviceRoot()
File "/work/mozilla/virtualenvwrapper/autophone/mozbase/mozdevice/mozdevice/devicemanagerSUT.py", line 727, in getDeviceRoot
data = self._runCmds([{ 'cmd': 'testroot' }])
File "/work/mozilla/virtualenvwrapper/autophone/mozbase/mozdevice/mozdevice/devicemanagerSUT.py", line 140, in _runCmds
self._sendCmds(cmdlist, outputfile, timeout)
File "/work/mozilla/virtualenvwrapper/autophone/mozbase/mozdevice/mozdevice/devicemanagerSUT.py", line 132, in _sendCmds
raise DMError("Remote Device Error: unable to connect to %s after %s attempts" % (self.host, self.retrylimit))
DMError: Remote Device Error: unable to connect to 192.168.1.25 after 8 attempts
I didn't see the Exception while checking SD card! message though.
| Assignee | ||
Comment 5•13 years ago
|
||
Fixed some exception handling. All DeviceManager calls in worker.py should be caught now. As mentioned in IRC, the crash handling in AutoPhone should be a last resort; DeviceManager errors should be caught and result in the worker being marked DISCONNECTED.
Also some refactoring so that a failed job gets retried later instead of being discarded. Probably each job should have a maximum retry too, but this is better than before, where jobs were abandoned after random device or network errors.
Attachment #670685 -
Attachment is obsolete: true
Attachment #670685 -
Flags: review?(bclary)
Attachment #671159 -
Flags: review?(bclary)
| Assignee | ||
Comment 6•13 years ago
|
||
Updated with better filename for check_sdcard() test file as suggested in comment 2.
To expand a bit on my last comment and to recap some IRC discussion, the DISABLED state, in which AutoPhone does *not* attempt to interact with the remote device at all, is now used only in these two cases:
- user request, via the 'disable' command
- the worker subprocess has crashed more than 5 times in 30 seconds.
The latter is done to avoid really vicious crash loops, e.g. if PhoneWorkerSubProcess.check_sdcard(), which is always executed when a non-DISABLED worker starts, causes a crash.
Generally, a device error (DeviceManager.DMError exception raised) should cause the worker to go into the DISCONNECTED state, in which no jobs are performed but the phone is still pinged every 15 minutes. If one of those pings succeeds, the worker attempts to reboot the phone and reconnect; if that is successful, it proceeds with any queued jobs, which now includes the one being performed when a DMError was raised.
Admittedly never dropping a job might be problematic if there's a bad build, but that is hard to determine right now because of a problem which I filed as bug 800236: there is no easy way to tell if a DMError was the result of a device/network error (e.g. connection dropped) or a command failure (e.g. trying to overwrite a directory with a file). Ideally we'd have different behaviour here, at least in some cases.
Attachment #671159 -
Attachment is obsolete: true
Attachment #671159 -
Flags: review?(bclary)
Attachment #671175 -
Flags: review?(bclary)
Comment 7•13 years ago
|
||
Comment on attachment 671175 [details] [diff] [review]
Add DISCONNECTED state, update for DM-exceptions change
>diff --git a/autophone.py b/autophone.py
[snip]
>diff --git a/mailer.py b/mailer.py
[snip]
>diff --git a/phonetest.py b/phonetest.py
[snip]
>diff --git a/worker.py b/worker.py
[snip]
> def recover_phone(self):
> reboots = 0
>- while not self.disabled:
>- if reboots < self.MAX_REBOOT_ATTEMPTS:
>- logging.info('Rebooting phone...')
>- reboots += 1
>- success = self.dm.reboot(self.ipaddr, 30000+self.worker_num)
>- if success:
>+ while reboots < self.MAX_REBOOT_ATTEMPTS:
>+ logging.info('Rebooting phone...')
>+ reboots += 1
>+ # FIXME: reboot() no longer indicates success/failure; instead
>+ # we just verify the device root.
>+ try:
>+ self.dm.reboot(self.ipaddr, 30000+self.worker_num)
>+ if self.dm.getDeviceRoot():
> logging.info('Phone is back up.')
> if self.check_sdcard():
>- self.disabled = False
> return
> logging.info('Failed SD card check.')
> else:
> logging.info('Phone did not reboot successfully.')
>- # DM can be in a weird state if reboot failed.
>- self.disconnect_dm()
>- else:
>- logging.info('Phone has been rebooted %d times; giving up.' %
>- reboots)
>- self.disable_phone('Phone was rebooted %d times.' % reboots)
>+ except DMError:
>+ logging.error('Exception while checking SD card!')
>+ logging.error(traceback.format_exc())
>+ # DM can be in a weird state if reboot failed.
>+ self.disconnect_dm()
>+
>+ logging.info('Phone has been rebooted %d times; giving up.' %
>+ reboots)
>+ self.phone_disconnected('Phone was rebooted %d times.' %
>+ reboots)
>
Maybe send the exception traceback along with the email message?
See below.
[snip]
>+ def do_job(self, job):
>+ if not self.has_error():
>+ logging.info('Rebooting...')
>+ self.reboot()
>+
>+ # may have gotten an error trying to reboot, so test again
>+ if self.has_error():
>+ logging.info('Phone is in error state; queuing job '
>+ 'for later.')
>+ # FIXME: Write these to a file to resume eventually.
>+ # We'll need a way (command, cli option) to automatically
>+ # clear these as well.
>+ return False
>+
>+ self.status_update(phonetest.PhoneTestMessage(
>+ self.phone_cfg['phoneid'],
>+ phonetest.PhoneTestMessage.INSTALLING,
>+ job['blddate']))
>+ logging.info('Installing build %s.' %
>+ datetime.datetime.fromtimestamp(float(job['blddate'])))
>+
>+ try:
>+ pathOnDevice = posixpath.join(self.dm.getDeviceRoot(),
>+ 'build.apk')
>+ self.dm.pushFile(os.path.join(job['cache_build_dir'],
>+ 'build.apk'), pathOnDevice)
>+ self.dm.installApp(pathOnDevice)
>+ self.dm.removeFile(pathOnDevice)
>+ except DMError:
>+ logging.error('Exception installing fennec!')
>+ logging.error(traceback.format_exc())
>+ self.phone_disconnected('Exception installing fennec.')
I ran into a case where my device did not have sufficient space. I think it would
have been nice to include the actual exception traceback in the email message
rather than having to grep the log. What do you think?
>+ return False
>+ self.current_build = job['blddate']
>+
>+ logging.info('Running tests...')
>+ for t in self.tests:
>+ if self.has_error():
>+ break
>+ t.current_build = job['blddate']
>+ # TODO: Attempt to see if pausing between jobs helps with
>+ # our reconnection issues
>+ time.sleep(30)
>+ try:
>+ t.runjob(job)
>+ except DMError:
>+ logging.error('Uncaught device error while running '
>+ 'test!')
>+ logging.error(traceback.format_exc())
>+ # FIXME: We should retry the whole thing; as it is, we
>+ # actually skip this job.
>+ self.phone_disconnected('Exception installing fennec.')
Exception running test not installing fennec.
ditto including exception if you want.
Attachment #671175 -
Flags: review?(bclary) → review+
| Assignee | ||
Comment 8•13 years ago
|
||
That's a great suggestion. I have been putting it off (it's actually part of bug 791463), but this is a good opportunity. Here's a patch on top of the last.
Attachment #672307 -
Flags: review?(bclary)
Comment 9•13 years ago
|
||
(In reply to Mark Côté ( :mcote ) from comment #8)
> Created attachment 672307 [details] [diff] [review]
> Include exception details in emails
This looks good and I'm testing it atm. I'll r+ as soon as I confirm the email messages.
We also need to catch exceptions in either Autophone.get_build or BuildCache.get in the event of a network error which can terminate the CmdTCPHandler handler unless we also make sure to catch the exception in Autophone.route_cmd. Not sure how the data still left in the pipe to be consumed after an exception in CmdTCPHandler handler affects later commands.
| Assignee | ||
Comment 10•13 years ago
|
||
Good point. I think we should catch the exception in BuildCache and return None. This will handle both network-down and build-not-found errors. The AutoPhone.get_build() should handle None results. For now I think it's okay to just abort the job. My next big batch of changes already has a mechanism to deal with this, in that AutoPhone keeps a central job list, with some history, so retrying a job after a failed attempt to get a build will be easy.
Would you like to handle those changes (catch exception, abort job on None)?
Comment 11•13 years ago
|
||
(In reply to Mark Côté ( :mcote ) from comment #10)
> Would you like to handle those changes (catch exception, abort job on None)?
sure.
Updated•13 years ago
|
Attachment #672307 -
Flags: review?(bclary) → review+
| Assignee | ||
Comment 12•13 years ago
|
||
https://github.com/mozilla/autophone/commit/2b6d54ec0b92d5fc9ae9a49c8ada048018632e8e
This is all we need for this bug; you can file a separate bug for handling errors getting builds.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 13•13 years ago
|
||
filed Bug 803169
Updated•4 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•