Closed Bug 853961 Opened 13 years ago Closed 13 years ago

Revamp Autophone logging

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(13 files, 5 obsolete files)

fix retryLimit typo in worker.py 13 years ago Bob Clary [:bc] (inactive) 500 bytes, patch	mcote : review+	Details \| Diff \| Splinter Review
catch sqlite3.OperationalError's in jobs.py 13 years ago Bob Clary [:bc] (inactive) 4.20 KB, patch	mcote : review-	Details \| Diff \| Splinter Review
fix quitter's version limits 13 years ago Bob Clary [:bc] (inactive) 1.36 KB, patch	mcote : review+	Details \| Diff \| Splinter Review
make sure worker's locks are released 13 years ago Bob Clary [:bc] (inactive) 2.10 KB, patch	mcote : review+	Details \| Diff \| Splinter Review
logging 13 years ago Bob Clary [:bc] (inactive) 65.60 KB, patch	mcote : review+	Details \| Diff \| Splinter Review
analyze logcat 13 years ago Bob Clary [:bc] (inactive) 532 bytes, patch	mcote : review+	Details \| Diff \| Splinter Review
logs.zip 13 years ago Bob Clary [:bc] (inactive) 36.15 KB, application/zip		Details
catch sqlite3.OperationalErrors in jobs.py 13 years ago Bob Clary [:bc] (inactive) 9.12 KB, patch	mcote : review-	Details \| Diff \| Splinter Review
logging, v2 13 years ago Bob Clary [:bc] (inactive) 65.25 KB, patch	mcote : feedback+	Details \| Diff \| Splinter Review
fix hang in worker main_loop 13 years ago Bob Clary [:bc] (inactive) 3.41 KB, patch	mcote : review+	Details \| Diff \| Splinter Review
fix zipfile.BadZipfile typo in builds.py 13 years ago Bob Clary [:bc] (inactive) 1.04 KB, patch	mcote : review+	Details \| Diff \| Splinter Review
fix missing DMError import in s1s2test.py 13 years ago Bob Clary [:bc] (inactive) 332 bytes, patch	mcote : review+	Details \| Diff \| Splinter Review
fix ftp 550 Failed to change directory getting symbols 13 years ago Bob Clary [:bc] (inactive) 629 bytes, patch	mcote : review+	Details \| Diff \| Splinter Review
logging follow up patch to fix rollovers 13 years ago Bob Clary [:bc] (inactive) 2.87 KB, patch		Details \| Diff \| Splinter Review
catch sqlite3.OperationalErrors in jobs.py, v2 13 years ago Bob Clary [:bc] (inactive) 9.97 KB, patch		Details \| Diff \| Splinter Review
fix hang in worker main_loop, v2 13 years ago Bob Clary [:bc] (inactive) 3.94 KB, patch	mcote : review+	Details \| Diff \| Splinter Review
logging follow up patch to fix rollovers, v2 13 years ago Bob Clary [:bc] (inactive) 3.43 KB, patch	mcote : review-	Details \| Diff \| Splinter Review
logging follow up patch to fix rollovers, v3 13 years ago Bob Clary [:bc] (inactive) 3.97 KB, patch	mcote : review+	Details \| Diff \| Splinter Review

Bob Clary [:bc] (inactive)

Assignee

Description

•

13 years ago

I've been looking into the logging in Autophone. Some of the ideas I've considering: * at a minimum we should use TimedRotatingFileHandler's to rotate our logs everyday to keep them from growing without bound. * by using exc_info=1 in logging calls we can automatically get exception information in the message or we can use logging.exception to automatically get it as well and clean up some of our error handling code. * we can use StreamHandler's to customize which log messages are sent to the console and eliminate almost all uses of print. For example, we could say that only ERROR log messages are printed to the console while all messages are written to the file log. * it is possible to use SMTPHandler's and send specific log messages via email if we override the emit method to use smtplib.SMTP_SSL rather than smtplib.SMTP with smtplib.starttls() which doesn't work with mail.mozilla.com. * In individual tests, we often log messages that do not contain any additional metadata about what the test is doing. For example, in s1s2 we might say 'running fennec' but not say which build is being tested, whether caching is enabled or not, etc. It should be possible to use special formatters for our tests so that we can easily include all of the relevant metadata. * We sometimes use print statements (not often) and also call into mozbase.mozdevice.deviceManagerSUT which does use print. As a result we capture the stdout for workers and report it to .out text files. I really hate this partly due to the additional files but also the loss of any timing / sequence information between the messages in the log file and the out file. I suppose we could capture stdout from dm calls and log them with the other worker log messages, but I'm open to suggestions. * Currently the workers post status messages back to the Autophone process using a multiprocessing.Queue which are then logged into autophone's main log file. It would be possible I think to use a log server in the autophone process and a SocketHandler in the worker processes so that log messages are routed back to the Autophone process and reported there. Right now each worker logs its results to a separate log file. Losing that granularity is an issue, but it might be possible to use a different handler in Autophone to report worker log messages to separate file. *This is the most problematic idea and one that I can't justifiy.* Mark, Jeff: what do you think?

Jeff Hammel

Comment 1

•

13 years ago

I am not super-knowledgeable of the problem space, but what you outlined above seems well-thought out and a great place to start. I would like to curate lessons learned to mozlog, but it is a bit of a stub class so I wouldn't want to impede progress with it either. WRT logging the print statements of device manager... * Personally, when using as an API, I don't generally think DM should print anything, at least as a blanket solution. Maybe you pass it callbacks for whatever levels make sense internal to DM. That said, I don't know the code as well as others nor do I use the code as much as others, so its at best a back-seat opinion * It is possible to capture stdout/stderr for logging purposes. In fact, I would say that in reality you do want to capture stdout/stderr for logging purposes. Doing this is easy (monkey patch e.g. sys.stdout), but making sure everything continues to work is hard. E.g. cleanup, what if someone else is patching them, etc. I've found this an annoying problem that I've "solved" with continuous plugging of leaks. That said, I'd love to see a good solid solution to the problem. I don't really have an opinion vs log/worker vs message-passing and one log total. I would be concerned with messages that don't get received due to system failure, but I don't know the system, here again.

Mark Côté [:mcote]

Comment 2

•

13 years ago

Yup all good ideas. I would rank SMTP support lower than the others if our system is reasonably stable, i.e., only if it would really help debug errors. I think now that AutoPhone is physically maintainable reduces the need for this. Looks like using proper logging in DeviceManager is closed to finished (bug 793202). And yeah the socketserver is interesting, but I don't think it really solves any particular problem right now (YAGNI). We already redirect sysout in worker.py (top of PhoneWorkerSubProcess.run()). We could probably change that to use the worker's logger instead without too much trouble.

Bob Clary [:bc] (inactive)

Assignee

Comment 3

•

13 years ago

I have several patches unrelated to logging which apply upon one another so I'll just attach them here instead of filing independent bugs for each. Hope you don't mind.

Bob Clary [:bc] (inactive)

Assignee

Comment 4

•

13 years ago

Attached patch fix retryLimit typo in worker.py — Details — Splinter Review

Attachment #731826 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 5

•

13 years ago

Attached patch catch sqlite3.OperationalError's in jobs.py (obsolete) — Details — Splinter Review

catch operational errors when the sqlite jobs db is locked.

Attachment #731828 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 6

•

13 years ago

Attached patch fix quitter's version limits — Details — Splinter Review

When I updated the guid's in quitter, I stupidly limited the version to 19+. This changes it to 2+. *binary patch - use git apply*

Attachment #731831 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 7

•

13 years ago

Attached patch make sure worker's locks are released — Details — Splinter Review

Attachment #731833 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 8

•

13 years ago

Attached patch logging (obsolete) — Details — Splinter Review

Adds hierarchical log names, log decorators and multiprocessing wrappers for Stream and TimedRotating Filehandlers. I'll attach examples in a bit.

Attachment #731835 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 9

•

13 years ago

Attached patch analyze logcat — Details — Splinter Review

This removes the 90 second max time which caused more trouble than it was worth.

Attachment #731837 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 10

•

13 years ago

Attached file logs.zip — Details

short examples of logs.

Mark Côté [:mcote]

Comment 11

•

13 years ago

Comment on attachment 731826 [details] [diff] [review] fix retryLimit typo in worker.py Review of attachment 731826 [details] [diff] [review]: ----------------------------------------------------------------- Cool thanks. I wonder why we even have **kwargs as an argument to DeviceManagerSUT.__init__(). We don't use it, and it just hides errors like this. Going to ask someone about that.

Attachment #731826 - Flags: review?(mcote) → review+

Mark Côté [:mcote]

Comment 12

•

13 years ago

Comment on attachment 731828 [details] [diff] [review] catch sqlite3.OperationalError's in jobs.py Review of attachment 731828 [details] [diff] [review]: ----------------------------------------------------------------- So this makes sense for fetching the next job and checking for pending jobs, since the worker will later retry, but I'm concerned about the calls that post a new job, clear the queue, and mark a job as completed. I suppose it's not terrible to rerun jobs (the potential results of the latter two), although it would be a waste of time, but a failed call to post a new job will result in that job being skipped altogether. I think these should be in a retry loop, and I'd even say they should be retried forever, probably with an email notification if it takes more than X attempts, since in that case something is really wrong (frozen process locking the db or something).

Attachment #731828 - Flags: review?(mcote) → review-

Mark Côté [:mcote]

Comment 13

•

13 years ago

Comment on attachment 731831 [details] [diff] [review] fix quitter's version limits Review of attachment 731831 [details] [diff] [review]: ----------------------------------------------------------------- Gonna take your word on this one. :)

Attachment #731831 - Flags: review?(mcote) → review+

Mark Côté [:mcote]

Comment 14

•

13 years ago

Comment on attachment 731833 [details] [diff] [review] make sure worker's locks are released Review of attachment 731833 [details] [diff] [review]: ----------------------------------------------------------------- This looks okay, but given that any exception in autophone.py should terminate the main process, is this really needed?

Attachment #731833 - Flags: review?(mcote) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 15

•

13 years ago

(In reply to Mark Côté ( :mcote ) from comment #14) > This looks okay, but given that any exception in autophone.py should > terminate the main process, is this really needed? In experience I was getting disconnected phones which would not recover and which would not respond to commands sent to autophone even though autophone would say it had received them. Part of the fix may have been the job exceptions patch, but I don't think it was the full solution. Reviewing the code, I can't justify it off hand and don't remember if I had a particular reason when I added it other than eliminating possible causes of disconnected phones being unrecoverable. If I recall correctly though, with this patch+the job exceptions patch my disconnected phones would reconnect and I didn't have further situations where I had to restart autophone.py to recover a phone. I don't think the job exceptions patch was sufficient but I won't swear to it.

Bob Clary [:bc] (inactive)

Assignee

Comment 16

•

13 years ago

PS. Once the logging patch lands, I can add some additional logging statements and see if/when I hit these finally clauses.

Mark Côté [:mcote]

Comment 17

•

13 years ago

Comment on attachment 731835 [details] [diff] [review] logging Review of attachment 731835 [details] [diff] [review]: ----------------------------------------------------------------- Okay now that I mostly understand what you're doing, this looks good. :) Some nits below. ::: autophone.py @@ +31,4 @@ > from mailer import Mailer > from worker import PhoneWorker > > +from multiprocessinghandlers import MultiprocessingStreamHandler, MultiprocessingTimedRotatingFileHandler Move this up with other local "from" imports. @@ +153,1 @@ > 'values.' % phoneid) Fix indentation. Same with a few places below. @@ +155,5 @@ > worker.user_cfg) > del self.restart_workers[phoneid] > continue > > + self.console_logger.error('Worker %s died!' % phoneid) We don't want to log this autophone.log anymore? @@ +431,5 @@ > + logger.setLevel(loglevel) > + > + filehandler = MultiprocessingTimedRotatingFileHandler(logfile, > + when='midnight', > + backupCount=7) Fix indentation. @@ +432,5 @@ > + > + filehandler = MultiprocessingTimedRotatingFileHandler(logfile, > + when='midnight', > + backupCount=7) > + #fileformatter = logging.Formatter('%(asctime)s|%(levelname)s|%(message)s') Don't think this line is needed anymore at all. @@ +438,5 @@ > + # '|%(filename)s|%(funcName)s' > + # '|%(lineno)d|%(name)s' > + # '|%(process)d|%(processName)s' > + # '|%(thread)d|%(threadName)s' > + # '|%(message)s') I'm not crazy about leaving commented-out code in there--it has a tendency to bitrot or be checked in accidentally. Maybe we can control this with an option? ::: logdecorator.py @@ +13,5 @@ > + if re.search('(%[(][a-zA-Z]+[)][^a-z]|%[(][a-zA-Z]+[)]$)', extraformat): > + raise ValueError('format string contains a %(attribute)' > + 'pattern without a type specifier.') > + > + Only one blank line between functions. @@ +21,5 @@ > + extramsg = self._extraformat % extradict > + if '%(msg)s' in extramsg: > + import traceback > + import StringIO > + import sys Why aren't these imported at the top of the file? @@ +31,5 @@ > + sys.stderr.write('msg: %s\n' % msg) > + sys.stderr.write('extradict: %s\n' % extradict) > + sys.stderr.write('extraformat: %s\n' % self._extraformat) > + sys.stderr.write('extramsg: %s\n' % extramsg) > + sys.stderr.write(80*'-' + '\n') This should probably have (a) a comment in the code explaining why it's here and (b) a comment in the stderr output explaining why this is being output. ::: multiprocessinghandlers.py @@ +10,5 @@ > + > + lock = multiprocessing.Lock() > + > + def emit(self, record): > + Remove blank line (same below). @@ +15,5 @@ > + self.lock.acquire() > + try: > + logging.StreamHandler.emit(self, record) > + except Exception, e: > + pass Why are exceptions being completely ignored here (and below)? ::: tests/runtestsremote.py @@ +30,5 @@ > class UnitTest(PhoneTest): > > def runjob(self, build_metadata, worker_subprocess): > + logger = self.logger > + loggerdeco = self.loggerdeco We do this in every test now. Can it be moved up to PhoneTest somehow? @@ +45,5 @@ > + self.logger = logger > + self.loggerdeco = loggerdeco > + > + def setup_and_runtest(self, build_metadata, worker_subprocess): > + Remove blank line. ::: tests/s1s2test.py @@ +80,5 @@ > + 'testname': testname}, > + '%(phoneid)s|%(phoneip)s|%(buildid)s|' > + 'Cache: %(cache_enabled)s|' > + '%(testname)s|%(msg)s') > + self.loggerdeco.info('Running test (%d/%d) for %s iterations' % Just realized the %s should be a %d here. @@ +89,5 @@ > attempt = 0 > while not success and attempt < 3: > # Set status > self.set_status(msg='Test %d/%d, run %s, attempt %s for url %s' % > + (testnum, len(self._urls.keys()), i, attempt, Same with the first two %s's. ::: worker.py @@ +515,5 @@ > + # '|%(filename)s|%(funcName)s' > + # '|%(lineno)d|%(name)s' > + # '|%(process)d|%(processName)s' > + # '|%(thread)d|%(threadName)s' > + # '|%(message)s') As with autophone.py, maybe we can control this with the same option?

Attachment #731835 - Flags: review?(mcote) → review+

Mark Côté [:mcote]

Comment 18

•

13 years ago

Comment on attachment 731837 [details] [diff] [review] analyze logcat Review of attachment 731837 [details] [diff] [review]: ----------------------------------------------------------------- ::: tests/s1s2test.py @@ +252,4 @@ > else: > fennec_crashed = False > if throbstart and throbstop == 0 and not fennec_crashed: > + #throbstop = int(throbstart) + max_time * 1000 Just take out the commented-out line altogether. We can always go back through history if we want to know what we used to do, but I can't see us bringing it back anytime soon.

Attachment #731837 - Flags: review?(mcote) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 19

•

13 years ago

(In reply to Mark Côté ( :mcote ) from comment #17) > Comment on attachment 731835 [details] [diff] [review] > logging > > Review of attachment 731835 [details] [diff] [review]: > > I'm not crazy about leaving commented-out code in there--it has a tendency > to bitrot or be checked in accidentally. Maybe we can control this with an > option? It really isn't that useful. It was mostly educational for me to see which loggers were emitting statements. I mostly left it in for you in case you wanted to see the multiprocess behavior. I'd rather just remove it altogether really. > > @@ +21,5 @@ > > + extramsg = self._extraformat % extradict > > + if '%(msg)s' in extramsg: > > + import traceback > > + import StringIO > > + import sys > > Why aren't these imported at the top of the file? > > @@ +31,5 @@ > > + sys.stderr.write('msg: %s\n' % msg) > > + sys.stderr.write('extradict: %s\n' % extradict) > > + sys.stderr.write('extraformat: %s\n' % self._extraformat) > > + sys.stderr.write('extramsg: %s\n' % extramsg) > > + sys.stderr.write(80*'-' + '\n') > > This should probably have (a) a comment in the code explaining why it's here > and (b) a comment in the stderr output explaining why this is being output. > They were a debugging aid before I added the ValueError in the init. I think I should just remove it altogether since if I'm correct the ValueError in init would block this from ever happening. > ::: multiprocessinghandlers.py > @@ +10,5 @@ > > + > > + lock = multiprocessing.Lock() > > + > > + def emit(self, record): > > + > > Remove blank line (same below). > > @@ +15,5 @@ > > + self.lock.acquire() > > + try: > > + logging.StreamHandler.emit(self, record) > > + except Exception, e: > > + pass > > Why are exceptions being completely ignored here (and below)? > http://docs.python.org/2/library/logging.html#logging.Handler.handleError I was following the advice that logging exceptions be ignored but really I think I should remove the try: and let the logging system decide how to handle it as described in that link. > ::: tests/runtestsremote.py > @@ +30,5 @@ > > class UnitTest(PhoneTest): > > > > def runjob(self, build_metadata, worker_subprocess): > > + logger = self.logger > > + loggerdeco = self.loggerdeco > > We do this in every test now. Can it be moved up to PhoneTest somehow? Let me think about it a bit. I need to be careful to reset the values after they have changed. I may not need the logger part though. > > ::: worker.py > @@ +515,5 @@ > > + # '|%(filename)s|%(funcName)s' > > + # '|%(lineno)d|%(name)s' > > + # '|%(process)d|%(processName)s' > > + # '|%(thread)d|%(threadName)s' > > + # '|%(message)s') > > As with autophone.py, maybe we can control this with the same option? just remove it altogether as in above?

Mark Côté [:mcote]

Comment 20

•

13 years ago

(In reply to Bob Clary [:bc:] from comment #19) > (In reply to Mark Côté ( :mcote ) from comment #17) > > Comment on attachment 731835 [details] [diff] [review] > > logging > > > > Review of attachment 731835 [details] [diff] [review]: > > > > > I'm not crazy about leaving commented-out code in there--it has a tendency > > to bitrot or be checked in accidentally. Maybe we can control this with an > > option? > > It really isn't that useful. It was mostly educational for me to see which > loggers were emitting statements. I mostly left it in for you in case you > wanted to see the multiprocess behavior. I'd rather just remove it > altogether really. Fine by me. > > ::: multiprocessinghandlers.py > > @@ +10,5 @@ > > > + > > > + lock = multiprocessing.Lock() > > > + > > > + def emit(self, record): > > > + > > > > Remove blank line (same below). > > > > @@ +15,5 @@ > > > + self.lock.acquire() > > > + try: > > > + logging.StreamHandler.emit(self, record) > > > + except Exception, e: > > > + pass > > > > Why are exceptions being completely ignored here (and below)? > > > > http://docs.python.org/2/library/logging.html#logging.Handler.handleError > I was following the advice that logging exceptions be ignored but really I > think I should remove the try: and let the logging system decide how to > handle it as described in that link. Ok. > > ::: worker.py > > @@ +515,5 @@ > > > + # '|%(filename)s|%(funcName)s' > > > + # '|%(lineno)d|%(name)s' > > > + # '|%(process)d|%(processName)s' > > > + # '|%(thread)d|%(threadName)s' > > > + # '|%(message)s') > > > > As with autophone.py, maybe we can control this with the same option? > > just remove it altogether as in above? Yeah I think so.

Bob Clary [:bc] (inactive)

Assignee

Comment 21

•

13 years ago

Attached patch catch sqlite3.OperationalErrors in jobs.py (obsolete) — Details — Splinter Review

I moved this to apply after the logging patch so we could use the new logging here.

Attachment #731828 - Attachment is obsolete: true

Attachment #733903 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 22

•

13 years ago

Attached patch logging, v2 — Details — Splinter Review

FYI, Hopefully covers all of your review comments. Feel free to do a new review if you want. ;-0

Attachment #731835 - Attachment is obsolete: true

Attachment #733905 - Flags: feedback?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 23

•

13 years ago

(In reply to Bob Clary [:bc:] from comment #22) > Created attachment 733905 [details] [diff] [review] > logging There is a problem with this patch. When the day rolls over the worker processes continue to write to the original autophone.log file which has been renamed autophone.log.YYYY-MM-DD. Their private logs rollover correctly. Trying to figure this out.

Bob Clary [:bc] (inactive)

Assignee

Comment 24

•

13 years ago

Attached patch fix hang in worker main_loop (obsolete) — Details — Splinter Review

This patch contains mostly a bunch of logging statements. The meat is in main_loop where we recover a phone if it has_error() and sleep for a bit if all commands and jobs are processed. Without the sleep we get into an infinite loop that starves the cpu and prevents us from getting anything from the multiprocessing queue. The main_loop: Commands and jobs processed log message can be pretty verbose in the log, so we might want to take that out, but its debug only so really shouldn't be a problem in production runs.

Attachment #734670 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 25

•

13 years ago

Attached patch fix zipfile.BadZipfile typo in builds.py — Details — Splinter Review

Attachment #734671 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 26

•

13 years ago

Attached patch fix missing DMError import in s1s2test.py — Details — Splinter Review

Attachment #734673 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 27

•

13 years ago

Attached patch fix ftp 550 Failed to change directory getting symbols — Details — Splinter Review

When failing to get the symbols file, the IOError exception ioerror, does not alway contain the duplicate ioerror.strerror.strerror attribute properties. I'm not really sure what's up with the ftp error handling, but this patch converts the exception to a string and checks for the 550 in that.

Attachment #734677 - Flags: review?(mcote)

Mark Côté [:mcote]

Comment 28

•

13 years ago

Comment on attachment 733903 [details] [diff] [review] catch sqlite3.OperationalErrors in jobs.py Review of attachment 733903 [details] [diff] [review]: ----------------------------------------------------------------- A few comments below. The r- is specifically for the code duplication, which will be a pain to maintain. ::: jobs.py @@ +44,5 @@ > + 'connection errors.') > + logger.exception('Attempt %d failed to connect to jobs ' > + 'database. Waiting for %d seconds.' % > + (attempt, self.SQL_RETRY_DELAY)) > + time.sleep(self.SQL_RETRY_DELAY) A few comments about these blocks: Waiting a full minute seems a bit excessive, no? I'm thinking an exception should be raised if it can't get a connection after a certain number of retries. You'd probably have to shut down everything anyway if one or more processes got this stuck. This code is repeated several times. I'm sure you could refactor it, possibly into a decorator. Finally, why have this wrapper in both _conn() *and* the functions that call it?

Attachment #733903 - Flags: review?(mcote) → review-

Mark Côté [:mcote]

Comment 29

•

13 years ago

Comment on attachment 733905 [details] [diff] [review] logging, v2 Just renaming this so I can try an interdiff.

Attachment #733905 - Attachment description: logging → logging, v2

Bob Clary [:bc] (inactive)

Assignee

Comment 30

•

13 years ago

(In reply to Mark Côté ( :mcote ) from comment #28) > Comment on attachment 733903 [details] [diff] [review] > catch sqlite3.OperationalErrors in jobs.py > > Review of attachment 733903 [details] [diff] [review]: > ----------------------------------------------------------------- > > A few comments below. The r- is specifically for the code duplication, > which will be a pain to maintain. > > ::: jobs.py > @@ +44,5 @@ > > + 'connection errors.') > > + logger.exception('Attempt %d failed to connect to jobs ' > > + 'database. Waiting for %d seconds.' % > > + (attempt, self.SQL_RETRY_DELAY)) > > + time.sleep(self.SQL_RETRY_DELAY) > > A few comments about these blocks: > > Waiting a full minute seems a bit excessive, no? > Not really I think. With 3G network emulation it takes several minutes to install a build. I'm not entirely clear how and where the sqlite db gets locked. One minute seemed safe, not too short, not too long... > I'm thinking an exception should be raised if it can't get a connection > after a certain number of retries. You'd probably have to shut down > everything anyway if one or more processes got this stuck. (In reply to Mark Côté ( :mcote ) from comment #12) > Comment on attachment 731828 [details] [diff] [review] > catch sqlite3.OperationalError's in jobs.py ... > will result in that job being skipped altogether. I think these should be > in a retry loop, and I'd even say they should be retried forever, probably > with an email notification if it takes more than X attempts, since in that > case something is really wrong (frozen process locking the db or something). > > This code is repeated several times. I'm sure you could refactor it, > possibly into a decorator. > Not sure what you mean about a decorator. > Finally, why have this wrapper in both _conn() *and* the functions that call > it? Because I wanted to distinguish the different cases.

Bob Clary [:bc] (inactive)

Assignee

Comment 31

•

13 years ago

Attached patch logging follow up patch to fix rollovers — Details — Splinter Review

Attachment #734847 - Flags: review?(mcote)

Mark Côté [:mcote]

Comment 32

•

13 years ago

Comment on attachment 733905 [details] [diff] [review] logging, v2 Review of attachment 733905 [details] [diff] [review]: ----------------------------------------------------------------- ::: autophone.py @@ +22,4 @@ > from manifestparser import TestManifest > from mozdevice.devicemanager import NetworkTools > from pulsebuildmonitor import start_pulse_monitor > +from multiprocessinghandlers import MultiprocessingStreamHandler, MultiprocessingTimedRotatingFileHandler Heh I didn't mean up this far--I meant with the other "from" imports of files in the same directory, e.g. mailer and worker.

Attachment #733905 - Flags: feedback?(mcote) → feedback+

Mark Côté [:mcote]

Comment 33

•

13 years ago

Comment on attachment 734670 [details] [diff] [review] fix hang in worker main_loop Review of attachment 734670 [details] [diff] [review]: ----------------------------------------------------------------- r+ing only as a temporary workaround, as discussed below. ::: worker.py @@ +483,4 @@ > self.ping() > + else: > + try: > + self.loggerdeco.debug('handle_cmd: Unknown request' % request[0]) Missing a %s here. @@ +484,5 @@ > + else: > + try: > + self.loggerdeco.debug('handle_cmd: Unknown request' % request[0]) > + except: > + self.loggerdeco.exception('handle_cmd: Unknown request') Why the blanket except here? @@ +519,5 @@ > if job: > self.handle_job(job) > + else: > + self.loggerdeco.debug('main_loop: Commands and jobs processed') > + time.sleep(10) As discussed, I don't think this should be necessary; I structured the loop such that it should block only on self.cmd_queue.get(...). But if you want to add it in just to temporarily fix your system, I can try to figure it out next week.

Attachment #734670 - Flags: review?(mcote) → review+

Mark Côté [:mcote]

Comment 34

•

13 years ago

Comment on attachment 734671 [details] [diff] [review] fix zipfile.BadZipfile typo in builds.py Review of attachment 734671 [details] [diff] [review]: ----------------------------------------------------------------- Nice catch. Stupid dynamic language.

Attachment #734671 - Flags: review?(mcote) → review+

Mark Côté [:mcote]

Comment 35

•

13 years ago

Comment on attachment 734673 [details] [diff] [review] fix missing DMError import in s1s2test.py Review of attachment 734673 [details] [diff] [review]: ----------------------------------------------------------------- Another good catch. Not sure why pyflakes isn't showing it to me.

Attachment #734673 - Flags: review?(mcote) → review+

Mark Côté [:mcote]

Comment 36

•

13 years ago

Comment on attachment 734677 [details] [diff] [review] fix ftp 550 Failed to change directory getting symbols Review of attachment 734677 [details] [diff] [review]: ----------------------------------------------------------------- ::: builds.py @@ +280,4 @@ > symbols_zipfile.extractall(symbols_path) > symbols_zipfile.close() > except IOError, ioerror: > + if '550 Failed to change directory' in ioerror.__str__(): I *think* str(ioerror) is the preferred way to do this, as it expressly tries to get a string out of the target object. __str__() is more of an implementation detail, one way to create this string. If this returns the same thing here, that is.

Attachment #734677 - Flags: review?(mcote) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 37

•

13 years ago

Attached patch catch sqlite3.OperationalErrors in jobs.py, v2 — Details — Splinter Review

Attachment #733903 - Attachment is obsolete: true

Bob Clary [:bc] (inactive)

Assignee

Comment 38

•

13 years ago

Attached patch fix hang in worker main_loop, v2 — Details — Splinter Review

I changed the loops to a single loop and it seems much more readable to me.

Attachment #734670 - Attachment is obsolete: true

Attachment #735200 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 39

•

13 years ago

(In reply to Mark Côté ( :mcote ) from comment #36) > I *think* str(ioerror) is the preferred way to do this, as it expressly > tries to get a string out of the target object Right. I'll update locally.

Bob Clary [:bc] (inactive)

Assignee

Comment 40

•

13 years ago

(In reply to Bob Clary [:bc:] from comment #31) > Created attachment 734847 [details] [diff] [review] > logging follow up patch to fix rollovers fwiw, this worked fine overnight with 8 workers. I like it. :-)

Bob Clary [:bc] (inactive)

Assignee

Comment 41

•

13 years ago

Comment on attachment 734847 [details] [diff] [review] logging follow up patch to fix rollovers Not quite ready for prime time. I still have a concurrency issue around rollover.

Attachment #734847 - Flags: review?(mcote)

Bob Clary [:bc] (inactive)

Assignee

Comment 42

•

13 years ago

Attached patch logging follow up patch to fix rollovers, v2 (obsolete) — Details — Splinter Review

If the log file is deleted by another process during rollover, wait until it is recreated before returning True the log should be reopened. I thought of using filelock and possibly other approaches, but I think this is cleanest. I have tested it somewhat but need to do some heavier testing with quicker rollovers to be absolutely sure this is it. You'll probably want me to take out the comments about Why? and Wait?

Attachment #736805 - Flags: review?(mcote)

Mark Côté [:mcote]

Comment 43

•

13 years ago

Comment on attachment 735200 [details] [diff] [review] fix hang in worker main_loop, v2 Review of attachment 735200 [details] [diff] [review]: ----------------------------------------------------------------- I like this; in particular, the removal of the complicated condition at the top of the old loop improves readability. I haven't tested it, but I mentally went through the loop a few times and it looks fine. Just a couple little tweaks. ::: worker.py @@ +501,5 @@ > + except Queue.Empty: > + request = None > + if self.has_error(): > + self.recover_phone() > + else: I think this should be "if not self.has_error()" rather than the "else" in case self.recover_phone() succeeded. @@ +504,5 @@ > + self.recover_phone() > + else: > + job = self.jobs.get_next_job() > + if job: > + self.handle_job(job) I think in this case, we should use "continue" or put the next bit in an "else". Otherwise, if there is another job queued up but no commands, we will wait for the timeout for no reason.

Attachment #735200 - Flags: review?(mcote) → review+

Mark Côté [:mcote]

Comment 44

•

13 years ago

Comment on attachment 736805 [details] [diff] [review] logging follow up patch to fix rollovers, v2 Review of attachment 736805 [details] [diff] [review]: ----------------------------------------------------------------- This seems almost right, with the change from filesize to device/inode comparison, but I am concerned about the unexpected OSError. We need to figure that out. ::: multiprocessinghandlers.py @@ +55,5 @@ > + """Check if the log file has already been rotated by another > + process by comparing the file size of the logfile name to the > + file size of the file pointed to by the stream attribute. If > + they differ, the log file named by baseFilename has been > + rotated, and the stream now points to the previous version.""" As discussed, the right way to do this is stat(self.baseFilename), fstat(self.stream.fileno()), and seeing if st_ino and st_dev are equal. If so, self.stream is pointing to self.baseFilename. @@ +64,5 @@ > + current_size = self.stream.tell() > + if current_size != baseFilename_size: > + return True > + return False > + except OSError, oserror: Narrow the scope of this try/except so we know exactly where the OSError is coming from (presumably when trying to access self.baseFilename, but let's be sure). @@ +65,5 @@ > + if current_size != baseFilename_size: > + return True > + return False > + except OSError, oserror: > + if oserror.errno != 2: Use the "errno" module instead of raw numbers, since they are platform dependent. In this case it would be errno.ENOENT. @@ +69,5 @@ > + if oserror.errno != 2: > + raise > + # The logfile has been deleted during a rollover by > + # another process. Why didn't the multiprocessing.lock > + # save me? Wait for the logfile to be recreated? We need to figure out why this is happening, rather than potentially masking a real problem that may present itself elsewhere later.

Attachment #736805 - Flags: review?(mcote) → review-

Mark Côté [:mcote]

Comment 45

•

13 years ago

If you are having trouble replicating the missing-file problem, then you can leave that loop in, but it would be nice to log the problem (with another emit call or something). I would like to figure it out.

Bob Clary [:bc] (inactive)

Assignee

Comment 46

•

13 years ago

Attached patch logging follow up patch to fix rollovers, v3 — Details — Splinter Review

This has been running locally since 4-13 but I haven't seen the rollover issue again. :-(

Attachment #736805 - Attachment is obsolete: true

Attachment #738136 - Flags: review?(mcote)

Mark Côté [:mcote]

Comment 47

•

13 years ago

Comment on attachment 738136 [details] [diff] [review] logging follow up patch to fix rollovers, v3 Review of attachment 738136 [details] [diff] [review]: ----------------------------------------------------------------- Looks great!

Attachment #738136 - Flags: review?(mcote) → review+

Bob Clary [:bc] (inactive)

Assignee

Comment 48

•

13 years ago

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Bob Clary [:bc] (inactive)

Assignee

Updated

•

13 years ago

Blocks: 872857

Bob Clary [:bc] (inactive)

Assignee

Updated

•

13 years ago

Blocks: 873081

BMO Automation

Updated

•

4 years ago

Product: Testing → Testing Graveyard

You need to log in before you can comment on or make changes to this bug.