Closed Bug 853961 Opened 11 years ago Closed 11 years ago

Revamp Autophone logging

Categories

(Testing Graveyard :: Autophone, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(13 files, 5 obsolete files)

500 bytes, patch
mcote
: review+
Details | Diff | Splinter Review
1.36 KB, patch
mcote
: review+
Details | Diff | Splinter Review
2.10 KB, patch
mcote
: review+
Details | Diff | Splinter Review
532 bytes, patch
mcote
: review+
Details | Diff | Splinter Review
36.15 KB, application/zip
Details
65.25 KB, patch
mcote
: feedback+
Details | Diff | Splinter Review
1.04 KB, patch
mcote
: review+
Details | Diff | Splinter Review
332 bytes, patch
mcote
: review+
Details | Diff | Splinter Review
629 bytes, patch
mcote
: review+
Details | Diff | Splinter Review
2.87 KB, patch
Details | Diff | Splinter Review
9.97 KB, patch
Details | Diff | Splinter Review
3.94 KB, patch
mcote
: review+
Details | Diff | Splinter Review
3.97 KB, patch
mcote
: review+
Details | Diff | Splinter Review
I've been looking into the logging in Autophone. Some of the ideas I've considering:

* at a minimum we should use  TimedRotatingFileHandler's to rotate our logs everyday to keep them from growing without bound.

* by using exc_info=1 in logging calls we can automatically get exception information in the message or we can use logging.exception to automatically get it as well  and clean up some of our error handling code.

* we can use  StreamHandler's to customize which log messages are sent to the console and eliminate almost all uses of print. For example, we could say that only ERROR log messages are printed to the console while all messages are written to the file log.

* it is possible to use SMTPHandler's and send specific log messages via email if we override the emit method to use smtplib.SMTP_SSL rather than smtplib.SMTP with smtplib.starttls() which doesn't work with mail.mozilla.com.

* In individual tests, we often log messages that do not contain any additional metadata about what the test is doing. For example, in s1s2 we might say 'running fennec' but not say which build is being tested, whether caching is enabled or not, etc. It should be possible to use special formatters for our tests so that  we can easily include all of the relevant metadata.

* We sometimes use print statements (not often) and also call into mozbase.mozdevice.deviceManagerSUT which does use print. As a result we capture the stdout for workers and report it to .out text files. I really hate this partly due to the additional files but also the loss of any timing / sequence information between the messages in the log file and the out file. I suppose we could capture stdout from dm calls and log them with the other worker log messages, but I'm open to suggestions.

* Currently the workers post status messages back to the Autophone process using a multiprocessing.Queue which are then logged into autophone's main log file. It would be possible I think to use a log server in the autophone process and a SocketHandler in the worker processes so that log messages are routed back to the Autophone process and reported there. Right now each worker logs its results to a separate log file. Losing that granularity is an issue, but it might be possible to use a different handler in Autophone to report worker log messages to separate file. *This is the most problematic idea and one that I can't justifiy.*

Mark, Jeff: what do you think?
I am not super-knowledgeable of the problem space, but what you outlined above seems well-thought out and a great place to start.  I would like to curate lessons learned to mozlog, but it is a bit of a stub class so I wouldn't want to impede progress with it either.

WRT logging the print statements of device manager...
* Personally, when using as an API, I don't generally think DM should print anything, at least as a blanket solution.  Maybe you pass it callbacks for whatever levels make sense internal to DM.  That said, I don't know the code as well as others nor do I use the code as much as others, so its at best a back-seat opinion
* It is possible to capture stdout/stderr for logging purposes.  In fact, I would say that in reality you do want to capture stdout/stderr for logging purposes.  Doing this is easy (monkey patch e.g. sys.stdout), but making sure everything continues to work is hard.  E.g. cleanup, what if someone else is patching them, etc.  I've found this an annoying problem that I've "solved" with continuous plugging of leaks.  That said, I'd love to see a good solid solution to the problem.

I don't really have an opinion vs log/worker vs message-passing and one log total.  I would be concerned with messages that don't get received due to system failure, but I don't know the system, here again.
Yup all good ideas.  I would rank SMTP support lower than the others if our system is reasonably stable, i.e., only if it would really help debug errors.  I think now that AutoPhone is physically maintainable reduces the need for this.

Looks like using proper logging in DeviceManager is closed to finished (bug 793202).

And yeah the socketserver is interesting, but I don't think it really solves any particular problem right now (YAGNI).

We already redirect sysout in worker.py (top of PhoneWorkerSubProcess.run()).  We could probably change that to use the worker's logger instead without too much trouble.
I have several patches unrelated to logging which apply upon one another so I'll just attach them here instead of filing independent bugs for each. Hope you don't mind.
Attachment #731826 - Flags: review?(mcote)
catch operational errors when the sqlite jobs db is locked.
Attachment #731828 - Flags: review?(mcote)
When I updated the guid's in quitter, I stupidly limited the version to 19+. This changes it to 2+.
*binary patch - use git apply*
Attachment #731831 - Flags: review?(mcote)
Attachment #731833 - Flags: review?(mcote)
Attached patch logging (obsolete) — Splinter Review
Adds hierarchical log names, log decorators and multiprocessing wrappers for  Stream and TimedRotating Filehandlers. I'll attach examples in a bit.
Attachment #731835 - Flags: review?(mcote)
Attached patch analyze logcatSplinter Review
This removes the 90 second max time which caused more trouble than it was worth.
Attachment #731837 - Flags: review?(mcote)
Attached file logs.zip
short examples of logs.
Comment on attachment 731826 [details] [diff] [review]
fix retryLimit typo in worker.py

Review of attachment 731826 [details] [diff] [review]:
-----------------------------------------------------------------

Cool thanks.

I wonder why we even have **kwargs as an argument to DeviceManagerSUT.__init__().  We don't use it, and it just hides errors like this.  Going to ask someone about that.
Attachment #731826 - Flags: review?(mcote) → review+
Comment on attachment 731828 [details] [diff] [review]
catch sqlite3.OperationalError's in jobs.py

Review of attachment 731828 [details] [diff] [review]:
-----------------------------------------------------------------

So this makes sense for fetching the next job and checking for pending jobs, since the worker will later retry, but I'm concerned about the calls that post a new job, clear the queue, and mark a job as completed.  I suppose it's not terrible to rerun jobs (the potential results of the latter two), although it would be a waste of time, but a failed call to post a new job will result in that job being skipped altogether.  I think these should be in a retry loop, and I'd even say they should be retried forever, probably with an email notification if it takes more than X attempts, since in that case something is really wrong (frozen process locking the db or something).
Attachment #731828 - Flags: review?(mcote) → review-
Comment on attachment 731831 [details] [diff] [review]
fix quitter's version limits

Review of attachment 731831 [details] [diff] [review]:
-----------------------------------------------------------------

Gonna take your word on this one. :)
Attachment #731831 - Flags: review?(mcote) → review+
Comment on attachment 731833 [details] [diff] [review]
make sure worker's locks are released

Review of attachment 731833 [details] [diff] [review]:
-----------------------------------------------------------------

This looks okay, but given that any exception in autophone.py should terminate the main process, is this really needed?
Attachment #731833 - Flags: review?(mcote) → review+
(In reply to Mark Côté ( :mcote ) from comment #14)

> This looks okay, but given that any exception in autophone.py should
> terminate the main process, is this really needed?

In experience I was getting disconnected phones which would not recover and which would not respond to commands sent to autophone even though autophone would say it had received them. Part of the fix may have been the job exceptions patch, but I don't think it was the full solution.

Reviewing the code, I can't justify it off hand and don't remember if I had a particular reason when I added it other than eliminating possible causes of disconnected phones being unrecoverable. If I recall correctly though, with this patch+the job exceptions patch my disconnected phones would reconnect and I didn't have further situations where I had to restart autophone.py to recover a phone. I don't think the job exceptions patch was sufficient but I won't swear to it.
PS. Once the logging patch lands, I can add some additional logging statements and see if/when I hit these finally clauses.
Comment on attachment 731835 [details] [diff] [review]
logging

Review of attachment 731835 [details] [diff] [review]:
-----------------------------------------------------------------

Okay now that I mostly understand what you're doing, this looks good. :) Some nits below.

::: autophone.py
@@ +31,4 @@
>  from mailer import Mailer
>  from worker import PhoneWorker
>  
> +from multiprocessinghandlers import MultiprocessingStreamHandler, MultiprocessingTimedRotatingFileHandler

Move this up with other local "from" imports.

@@ +153,1 @@
>                                   'values.' % phoneid)

Fix indentation.  Same with a few places below.

@@ +155,5 @@
>                                         worker.user_cfg)
>                      del self.restart_workers[phoneid]
>                      continue
>  
> +                self.console_logger.error('Worker %s died!' % phoneid)

We don't want to log this autophone.log anymore?

@@ +431,5 @@
> +    logger.setLevel(loglevel)
> +
> +    filehandler = MultiprocessingTimedRotatingFileHandler(logfile,
> +                                                            when='midnight',
> +                                                            backupCount=7)

Fix indentation.

@@ +432,5 @@
> +
> +    filehandler = MultiprocessingTimedRotatingFileHandler(logfile,
> +                                                            when='midnight',
> +                                                            backupCount=7)
> +    #fileformatter = logging.Formatter('%(asctime)s|%(levelname)s|%(message)s')

Don't think this line is needed anymore at all.

@@ +438,5 @@
> +    #                    '|%(filename)s|%(funcName)s'
> +    #                    '|%(lineno)d|%(name)s'
> +    #                    '|%(process)d|%(processName)s'
> +    #                    '|%(thread)d|%(threadName)s'
> +    #                    '|%(message)s')

I'm not crazy about leaving commented-out code in there--it has a tendency to bitrot or be checked in accidentally. Maybe we can control this with an option?

::: logdecorator.py
@@ +13,5 @@
> +        if re.search('(%[(][a-zA-Z]+[)][^a-z]|%[(][a-zA-Z]+[)]$)', extraformat):
> +            raise ValueError('format string contains a %(attribute)'
> +                             'pattern without a type specifier.')
> +
> +

Only one blank line between functions.

@@ +21,5 @@
> +        extramsg = self._extraformat % extradict
> +        if '%(msg)s' in extramsg:
> +            import traceback
> +            import StringIO
> +            import sys

Why aren't these imported at the top of the file?

@@ +31,5 @@
> +            sys.stderr.write('msg: %s\n' % msg)
> +            sys.stderr.write('extradict: %s\n' % extradict)
> +            sys.stderr.write('extraformat: %s\n' % self._extraformat)
> +            sys.stderr.write('extramsg: %s\n' % extramsg)
> +            sys.stderr.write(80*'-' + '\n')

This should probably have (a) a comment in the code explaining why it's here and (b) a comment in the stderr output explaining why this is being output.

::: multiprocessinghandlers.py
@@ +10,5 @@
> +
> +    lock = multiprocessing.Lock()
> +
> +    def emit(self, record):
> +

Remove blank line (same below).

@@ +15,5 @@
> +        self.lock.acquire()
> +        try:
> +            logging.StreamHandler.emit(self, record)
> +        except Exception, e:
> +            pass

Why are exceptions being completely ignored here (and below)?

::: tests/runtestsremote.py
@@ +30,5 @@
>  class UnitTest(PhoneTest):
>  
>      def runjob(self, build_metadata, worker_subprocess):
> +        logger = self.logger
> +        loggerdeco = self.loggerdeco

We do this in every test now.  Can it be moved up to PhoneTest somehow?

@@ +45,5 @@
> +            self.logger = logger
> +            self.loggerdeco = loggerdeco
> +
> +    def setup_and_runtest(self, build_metadata, worker_subprocess):
> +

Remove blank line.

::: tests/s1s2test.py
@@ +80,5 @@
> +                                            'testname': testname},
> +                                           '%(phoneid)s|%(phoneip)s|%(buildid)s|'
> +                                           'Cache: %(cache_enabled)s|'
> +                                           '%(testname)s|%(msg)s')
> +            self.loggerdeco.info('Running test (%d/%d) for %s iterations' %

Just realized the %s should be a %d here.

@@ +89,5 @@
>                  attempt = 0
>                  while not success and attempt < 3:
>                      # Set status
>                      self.set_status(msg='Test %d/%d, run %s, attempt %s for url %s' %
> +                                    (testnum, len(self._urls.keys()), i, attempt,

Same with the first two %s's.

::: worker.py
@@ +515,5 @@
> +        #                    '|%(filename)s|%(funcName)s'
> +        #                    '|%(lineno)d|%(name)s'
> +        #                    '|%(process)d|%(processName)s'
> +        #                    '|%(thread)d|%(threadName)s'
> +        #                    '|%(message)s')

As with autophone.py, maybe we can control this with the same option?
Attachment #731835 - Flags: review?(mcote) → review+
Comment on attachment 731837 [details] [diff] [review]
analyze logcat

Review of attachment 731837 [details] [diff] [review]:
-----------------------------------------------------------------

::: tests/s1s2test.py
@@ +252,4 @@
>          else:
>              fennec_crashed = False
>          if throbstart and throbstop == 0 and not fennec_crashed:
> +            #throbstop = int(throbstart) + max_time * 1000

Just take out the commented-out line altogether.  We can always go back through history if we want to know what we used to do, but I can't see us bringing it back anytime soon.
Attachment #731837 - Flags: review?(mcote) → review+
(In reply to Mark Côté ( :mcote ) from comment #17)
> Comment on attachment 731835 [details] [diff] [review]
> logging
> 
> Review of attachment 731835 [details] [diff] [review]:

> 
> I'm not crazy about leaving commented-out code in there--it has a tendency
> to bitrot or be checked in accidentally. Maybe we can control this with an
> option?

It really isn't that useful. It was mostly educational for me to see which loggers were emitting statements. I mostly left it in for you in case you wanted to see the multiprocess behavior. I'd rather just remove it altogether really.

> 

> @@ +21,5 @@
> > +        extramsg = self._extraformat % extradict
> > +        if '%(msg)s' in extramsg:
> > +            import traceback
> > +            import StringIO
> > +            import sys
> 
> Why aren't these imported at the top of the file?

> 
> @@ +31,5 @@
> > +            sys.stderr.write('msg: %s\n' % msg)
> > +            sys.stderr.write('extradict: %s\n' % extradict)
> > +            sys.stderr.write('extraformat: %s\n' % self._extraformat)
> > +            sys.stderr.write('extramsg: %s\n' % extramsg)
> > +            sys.stderr.write(80*'-' + '\n')
> 
> This should probably have (a) a comment in the code explaining why it's here
> and (b) a comment in the stderr output explaining why this is being output.
> 

They were a debugging aid before I added the ValueError in the init. I think I should just remove it altogether since if I'm correct the ValueError in init would block this from ever happening.


> ::: multiprocessinghandlers.py
> @@ +10,5 @@
> > +
> > +    lock = multiprocessing.Lock()
> > +
> > +    def emit(self, record):
> > +
> 
> Remove blank line (same below).
> 
> @@ +15,5 @@
> > +        self.lock.acquire()
> > +        try:
> > +            logging.StreamHandler.emit(self, record)
> > +        except Exception, e:
> > +            pass
> 
> Why are exceptions being completely ignored here (and below)?
> 

http://docs.python.org/2/library/logging.html#logging.Handler.handleError
I was following the advice that logging exceptions be ignored but really I think I should remove the try: and let the logging system decide how to handle it as described in that link.


> ::: tests/runtestsremote.py
> @@ +30,5 @@
> >  class UnitTest(PhoneTest):
> >  
> >      def runjob(self, build_metadata, worker_subprocess):
> > +        logger = self.logger
> > +        loggerdeco = self.loggerdeco
> 
> We do this in every test now.  Can it be moved up to PhoneTest somehow?

Let me think about it a bit. I need to be careful to reset the values after they have changed. I may not need the logger part though.

> 

> ::: worker.py
> @@ +515,5 @@
> > +        #                    '|%(filename)s|%(funcName)s'
> > +        #                    '|%(lineno)d|%(name)s'
> > +        #                    '|%(process)d|%(processName)s'
> > +        #                    '|%(thread)d|%(threadName)s'
> > +        #                    '|%(message)s')
> 
> As with autophone.py, maybe we can control this with the same option?

just remove it altogether as in above?
(In reply to Bob Clary [:bc:] from comment #19)
> (In reply to Mark Côté ( :mcote ) from comment #17)
> > Comment on attachment 731835 [details] [diff] [review]
> > logging
> > 
> > Review of attachment 731835 [details] [diff] [review]:
> 
> > 
> > I'm not crazy about leaving commented-out code in there--it has a tendency
> > to bitrot or be checked in accidentally. Maybe we can control this with an
> > option?
> 
> It really isn't that useful. It was mostly educational for me to see which
> loggers were emitting statements. I mostly left it in for you in case you
> wanted to see the multiprocess behavior. I'd rather just remove it
> altogether really.

Fine by me.

> > ::: multiprocessinghandlers.py
> > @@ +10,5 @@
> > > +
> > > +    lock = multiprocessing.Lock()
> > > +
> > > +    def emit(self, record):
> > > +
> > 
> > Remove blank line (same below).
> > 
> > @@ +15,5 @@
> > > +        self.lock.acquire()
> > > +        try:
> > > +            logging.StreamHandler.emit(self, record)
> > > +        except Exception, e:
> > > +            pass
> > 
> > Why are exceptions being completely ignored here (and below)?
> > 
> 
> http://docs.python.org/2/library/logging.html#logging.Handler.handleError
> I was following the advice that logging exceptions be ignored but really I
> think I should remove the try: and let the logging system decide how to
> handle it as described in that link.

Ok.

> > ::: worker.py
> > @@ +515,5 @@
> > > +        #                    '|%(filename)s|%(funcName)s'
> > > +        #                    '|%(lineno)d|%(name)s'
> > > +        #                    '|%(process)d|%(processName)s'
> > > +        #                    '|%(thread)d|%(threadName)s'
> > > +        #                    '|%(message)s')
> > 
> > As with autophone.py, maybe we can control this with the same option?
> 
> just remove it altogether as in above?

Yeah I think so.
I moved this to apply after the logging patch so we could use the new logging here.
Attachment #731828 - Attachment is obsolete: true
Attachment #733903 - Flags: review?(mcote)
Attached patch logging, v2Splinter Review
FYI, Hopefully covers all of your review comments.  Feel free to do a new review if you want. ;-0
Attachment #731835 - Attachment is obsolete: true
Attachment #733905 - Flags: feedback?(mcote)
(In reply to Bob Clary [:bc:] from comment #22)
> Created attachment 733905 [details] [diff] [review]
> logging

There is a problem with this patch. When the day rolls over the worker processes continue to write to the original autophone.log file which has been renamed autophone.log.YYYY-MM-DD. Their private logs rollover correctly. Trying to figure this out.
Attached patch fix hang in worker main_loop (obsolete) — Splinter Review
This patch contains mostly a bunch of logging statements. The meat is in main_loop where we recover a phone if it has_error() and sleep for a bit if all commands and jobs are processed. Without the sleep we get into an infinite loop that starves the cpu and prevents us from getting anything from the multiprocessing queue. The main_loop: Commands and jobs processed log message can be pretty verbose in the log, so we might want to take that out, but its debug only so really shouldn't be a problem in production runs.
Attachment #734670 - Flags: review?(mcote)
When failing to get the symbols file, the IOError exception ioerror, does not alway contain the duplicate ioerror.strerror.strerror attribute properties. I'm not really sure what's up with the ftp error handling, but this patch converts the exception to a string and checks for the 550 in that.
Attachment #734677 - Flags: review?(mcote)
Comment on attachment 733903 [details] [diff] [review]
catch sqlite3.OperationalErrors in jobs.py

Review of attachment 733903 [details] [diff] [review]:
-----------------------------------------------------------------

A few comments below.  The r- is specifically for the code duplication, which will be a pain to maintain.

::: jobs.py
@@ +44,5 @@
> +                                'connection errors.')
> +                logger.exception('Attempt %d failed to connect to jobs '
> +                                 'database. Waiting for %d seconds.' %
> +                                 (attempt, self.SQL_RETRY_DELAY))
> +                time.sleep(self.SQL_RETRY_DELAY)

A few comments about these blocks:

Waiting a full minute seems a bit excessive, no?

I'm thinking an exception should be raised if it can't get a connection after a certain number of retries.  You'd probably have to shut down everything anyway if one or more processes got this stuck.

This code is repeated several times.  I'm sure you could refactor it, possibly into a decorator.

Finally, why have this wrapper in both _conn() *and* the functions that call it?
Attachment #733903 - Flags: review?(mcote) → review-
Comment on attachment 733905 [details] [diff] [review]
logging, v2

Just renaming this so I can try an interdiff.
Attachment #733905 - Attachment description: logging → logging, v2
(In reply to Mark Côté ( :mcote ) from comment #28)
> Comment on attachment 733903 [details] [diff] [review]
> catch sqlite3.OperationalErrors in jobs.py
> 
> Review of attachment 733903 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> A few comments below.  The r- is specifically for the code duplication,
> which will be a pain to maintain.
> 
> ::: jobs.py
> @@ +44,5 @@
> > +                                'connection errors.')
> > +                logger.exception('Attempt %d failed to connect to jobs '
> > +                                 'database. Waiting for %d seconds.' %
> > +                                 (attempt, self.SQL_RETRY_DELAY))
> > +                time.sleep(self.SQL_RETRY_DELAY)
> 
> A few comments about these blocks:
> 
> Waiting a full minute seems a bit excessive, no?
> 

Not really I think. With 3G network emulation it takes several minutes to install a build. I'm not entirely clear how and where the sqlite db gets locked. One minute seemed safe, not too short, not too long... 

> I'm thinking an exception should be raised if it can't get a connection
> after a certain number of retries.  You'd probably have to shut down
> everything anyway if one or more processes got this stuck.


(In reply to Mark Côté ( :mcote ) from comment #12)
> Comment on attachment 731828 [details] [diff] [review]
> catch sqlite3.OperationalError's in jobs.py
...
> will result in that job being skipped altogether.  I think these should be
> in a retry loop, and I'd even say they should be retried forever, probably
> with an email notification if it takes more than X attempts, since in that
> case something is really wrong (frozen process locking the db or something).

> 
> This code is repeated several times.  I'm sure you could refactor it,
> possibly into a decorator.
> 

Not sure what you mean about a decorator.

> Finally, why have this wrapper in both _conn() *and* the functions that call
> it?

Because I wanted to distinguish the different cases.
Comment on attachment 733905 [details] [diff] [review]
logging, v2

Review of attachment 733905 [details] [diff] [review]:
-----------------------------------------------------------------

::: autophone.py
@@ +22,4 @@
>  from manifestparser import TestManifest
>  from mozdevice.devicemanager import NetworkTools
>  from pulsebuildmonitor import start_pulse_monitor
> +from multiprocessinghandlers import MultiprocessingStreamHandler, MultiprocessingTimedRotatingFileHandler

Heh I didn't mean up this far--I meant with the other "from" imports of files in the same directory, e.g. mailer and worker.
Attachment #733905 - Flags: feedback?(mcote) → feedback+
Comment on attachment 734670 [details] [diff] [review]
fix hang in worker main_loop

Review of attachment 734670 [details] [diff] [review]:
-----------------------------------------------------------------

r+ing only as a temporary workaround, as discussed below.

::: worker.py
@@ +483,4 @@
>              self.ping()
> +        else:
> +            try:
> +                self.loggerdeco.debug('handle_cmd: Unknown request' % request[0])

Missing a %s here.

@@ +484,5 @@
> +        else:
> +            try:
> +                self.loggerdeco.debug('handle_cmd: Unknown request' % request[0])
> +            except:
> +                self.loggerdeco.exception('handle_cmd: Unknown request')

Why the blanket except here?

@@ +519,5 @@
>              if job:
>                  self.handle_job(job)
> +            else:
> +                self.loggerdeco.debug('main_loop: Commands and jobs processed')
> +                time.sleep(10)

As discussed, I don't think this should be necessary; I structured the loop such that it should block only on self.cmd_queue.get(...).  But if you want to add it in just to temporarily fix your system, I can try to figure it out next week.
Attachment #734670 - Flags: review?(mcote) → review+
Comment on attachment 734671 [details] [diff] [review]
fix zipfile.BadZipfile typo in builds.py

Review of attachment 734671 [details] [diff] [review]:
-----------------------------------------------------------------

Nice catch.  Stupid dynamic language.
Attachment #734671 - Flags: review?(mcote) → review+
Comment on attachment 734673 [details] [diff] [review]
fix missing DMError import in s1s2test.py

Review of attachment 734673 [details] [diff] [review]:
-----------------------------------------------------------------

Another good catch.  Not sure why pyflakes isn't showing it to me.
Attachment #734673 - Flags: review?(mcote) → review+
Comment on attachment 734677 [details] [diff] [review]
fix ftp 550 Failed to change directory getting symbols

Review of attachment 734677 [details] [diff] [review]:
-----------------------------------------------------------------

::: builds.py
@@ +280,4 @@
>                  symbols_zipfile.extractall(symbols_path)
>                  symbols_zipfile.close()
>              except IOError, ioerror:
> +                if '550 Failed to change directory' in ioerror.__str__():

I *think* str(ioerror) is the preferred way to do this, as it expressly tries to get a string out of the target object. __str__() is more of an implementation detail, one way to create this string.  If this returns the same thing here, that is.
Attachment #734677 - Flags: review?(mcote) → review+
Attachment #733903 - Attachment is obsolete: true
I changed the loops to a single loop and it seems much more readable to me.
Attachment #734670 - Attachment is obsolete: true
Attachment #735200 - Flags: review?(mcote)
(In reply to Mark Côté ( :mcote ) from comment #36)

> I *think* str(ioerror) is the preferred way to do this, as it expressly
> tries to get a string out of the target object

Right. I'll update locally.
(In reply to Bob Clary [:bc:] from comment #31)
> Created attachment 734847 [details] [diff] [review]
> logging follow up patch to fix rollovers

fwiw, this worked fine overnight with 8 workers. I like it. :-)
Comment on attachment 734847 [details] [diff] [review]
logging follow up patch to fix rollovers

Not quite ready for prime time. I still have a concurrency issue around rollover.
Attachment #734847 - Flags: review?(mcote)
If the log file is deleted by another process during rollover, wait until it is recreated before returning True the log should be reopened.

I thought of using filelock and possibly other approaches, but I think this is cleanest. I have tested it somewhat but need to do some heavier testing with quicker rollovers to be absolutely sure this is it. You'll probably want me to take out the comments about Why? and Wait?
Attachment #736805 - Flags: review?(mcote)
Comment on attachment 735200 [details] [diff] [review]
fix hang in worker main_loop, v2

Review of attachment 735200 [details] [diff] [review]:
-----------------------------------------------------------------

I like this; in particular, the removal of the complicated condition at the top of the old loop improves readability.  I haven't tested it, but I mentally went through the loop a few times and it looks fine.  Just a couple little tweaks.

::: worker.py
@@ +501,5 @@
> +            except Queue.Empty:
> +                request = None
> +                if self.has_error():
> +                    self.recover_phone()
> +                else:

I think this should be "if not self.has_error()" rather than the "else" in case self.recover_phone() succeeded.

@@ +504,5 @@
> +                    self.recover_phone()
> +                else:
> +                    job = self.jobs.get_next_job()
> +                    if job:
> +                        self.handle_job(job)

I think in this case, we should use "continue" or put the next bit in an "else".  Otherwise, if there is another job queued up but no commands, we will wait for the timeout for no reason.
Attachment #735200 - Flags: review?(mcote) → review+
Comment on attachment 736805 [details] [diff] [review]
logging follow up patch to fix rollovers, v2

Review of attachment 736805 [details] [diff] [review]:
-----------------------------------------------------------------

This seems almost right, with the change from filesize to device/inode comparison, but I am concerned about the unexpected OSError.  We need to figure that out.

::: multiprocessinghandlers.py
@@ +55,5 @@
> +        """Check if the log file has already been rotated by another
> +        process by comparing the file size of the logfile name to the
> +        file size of the file pointed to by the stream attribute.  If
> +        they differ, the log file named by baseFilename has been
> +        rotated, and the stream now points to the previous version."""

As discussed, the right way to do this is stat(self.baseFilename), fstat(self.stream.fileno()), and seeing if st_ino and st_dev are equal.  If so, self.stream is pointing to self.baseFilename.

@@ +64,5 @@
> +            current_size = self.stream.tell()
> +            if current_size != baseFilename_size:
> +                return True
> +            return False
> +        except OSError, oserror:

Narrow the scope of this try/except so we know exactly where the OSError is coming from (presumably when trying to access self.baseFilename, but let's be sure).

@@ +65,5 @@
> +            if current_size != baseFilename_size:
> +                return True
> +            return False
> +        except OSError, oserror:
> +            if oserror.errno != 2:

Use the "errno" module instead of raw numbers, since they are platform dependent.  In this case it would be errno.ENOENT.

@@ +69,5 @@
> +            if oserror.errno != 2:
> +                raise
> +            # The logfile has been deleted during a rollover by
> +            # another process. Why didn't the multiprocessing.lock
> +            # save me? Wait for the logfile to be recreated?

We need to figure out why this is happening, rather than potentially masking a real problem that may present itself elsewhere later.
Attachment #736805 - Flags: review?(mcote) → review-
If you are having trouble replicating the missing-file problem, then you can leave that loop in, but it would be nice to log the problem (with another emit call or something).  I would like to figure it out.
This has been running locally since 4-13 but I haven't seen the rollover issue again. :-(
Attachment #736805 - Attachment is obsolete: true
Attachment #738136 - Flags: review?(mcote)
Comment on attachment 738136 [details] [diff] [review]
logging follow up patch to fix rollovers, v3

Review of attachment 738136 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great!
Attachment #738136 - Flags: review?(mcote) → review+
Blocks: 872857
Blocks: 873081
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: