Closed Bug 662885 Opened 14 years ago Closed 14 years ago

Make PulseStatus not use threads

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(5 files, 4 obsolete files)

OOP pulse implementation 14 years ago Chris AtLee [:catlee] 27.00 KB, patch		Details \| Diff \| Splinter Review
OOP pulse publisher implementation 14 years ago Chris AtLee [:catlee] 27.19 KB, patch	dustin : review- rail : review+	Details \| Diff \| Splinter Review
set up queuedir instances in master_common.py 14 years ago Chris AtLee [:catlee] 5.59 KB, patch	bhearsum : review+	Details \| Diff \| Splinter Review
rip out old pulse/log uploading from buildbotcustom 14 years ago Chris AtLee [:catlee] 23.64 KB, patch	dustin : review+ bhearsum : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
OOP pulse publisher implementation 14 years ago Chris AtLee [:catlee] 27.67 KB, patch		Details \| Diff \| Splinter Review
OOP pulse publisher implementation 14 years ago Chris AtLee [:catlee] 27.68 KB, patch	dustin : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
puppet manifests for queue processors 14 years ago Chris AtLee [:catlee] 17.55 KB, patch	bear : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
updated configs to allow you to override where the queuedir lives 14 years ago Chris AtLee [:catlee] 40.46 KB, patch	bhearsum : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
nagios checks that queues are healthy 14 years ago Chris AtLee [:catlee] 1.19 KB, patch	bear : review+ catlee : checked-in+	Details \| Diff \| Splinter Review

Chris AtLee [:catlee]

Assignee

Description

•

14 years ago

Threads are suspected as a cause of huge memory usage. Let's see if we can make PulseStatus not use threads.

Chris AtLee [:catlee]

Assignee

Comment 1

•

14 years ago

Attached patch OOP pulse implementation (obsolete) — Details — Splinter Review

Chris AtLee [:catlee]

Assignee

Comment 2

•

14 years ago

Attached patch OOP pulse publisher implementation (obsolete) — Details — Splinter Review

Attachment #543941 - Attachment is obsolete: true

Attachment #544015 - Flags: review?

Chris AtLee [:catlee]

Assignee

Comment 3

•

14 years ago

Attached patch set up queuedir instances in master_common.py (obsolete) — Details — Splinter Review

Attachment #544017 - Flags: review?

Chris AtLee [:catlee]

Assignee

Comment 4

•

14 years ago

Attached patch rip out old pulse/log uploading from buildbotcustom — Details — Splinter Review

Attachment #544018 - Flags: review?

Chris AtLee [:catlee]

Assignee

Updated

•

14 years ago

Attachment #544015 - Flags: review?(rail)

Attachment #544015 - Flags: review?(dustin)

Attachment #544015 - Flags: review?

Chris AtLee [:catlee]

Assignee

Comment 5

•

14 years ago

Comment on attachment 544017 [details] [diff] [review] set up queuedir instances in master_common.py I'm not thrilled with this. If you have other ideas, I'm totally open to suggestions.

Attachment #544017 - Flags: review? → review?(bhearsum)

Chris AtLee [:catlee]

Assignee

Updated

•

14 years ago

Attachment #544018 - Flags: review?(dustin)

Attachment #544018 - Flags: review?(bhearsum)

Attachment #544018 - Flags: review?

bhearsum@mozilla.com (:bhearsum)

Comment 6

•

14 years ago

Attachment #544018 - Flags: review?(bhearsum) → review+

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

14 years ago

Comment on attachment 544017 [details] [diff] [review] set up queuedir instances in master_common.py Review of attachment 544017 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozilla-tests/scheduler_master.cfg @@ +6,4 @@ > import master_common > import config > import master_localconfig > +reload(buildbotcustom.misc) Out of curiosity, why is this getting rearranged?

Attachment #544017 - Flags: review?(bhearsum) → review+

Rail Aliiev [:rail]

Comment 8

•

14 years ago

Comment on attachment 544015 [details] [diff] [review] OOP pulse publisher implementation Overall it looks good. My only concern here is the fact that the directories are flat, so we may have performance problems for long queues (thousands of jobs).

Attachment #544015 - Flags: review?(rail) → review+

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

14 years ago

Comment on attachment 544018 [details] [diff] [review] rip out old pulse/log uploading from buildbotcustom Review of attachment 544018 [details] [diff] [review]: ----------------------------------------------------------------- ::: status/pulse.py @@ +98,4 @@ > > def pushEvents(self): > + """Trigger a push""" > + self.delayed_push = reactor.callLater(self.push_delay, self._do_push) It looks like this used to be protected by a DeferredLock. Would it make sense here to only set delayed_push if it's not already set, so that multiple pushes aren't queued simultaneously? Such multiple pushes wouldn't be a problem (since _do_push is atomic), but they wouldn't be properly cancelled by stopService. ::: status/queued_command.py @@ +33,5 @@ > + def disownServiceParent(self): > + self.master_status.unsubscribe(self) > + for w in self.watched: > + w.unsubscribe(self) > + return base.StatusReceiverMultiService.disownServiceParent(self) I think we worried about this on another patch - setServiceParent/disownServiceParent are not the best place to do startService/stopService operations. I think this is modeled on other code in buildbotcustom, and it *works*, so I'm OK with this part of the patch as-is, but I thought I'd mention it. @@ +36,5 @@ > + w.unsubscribe(self) > + return base.StatusReceiverMultiService.disownServiceParent(self) > + > + def stopService(self): > + base.StatusReceiverMultiService.stopService(self) This is redundant if it's just calling the parent method.

Attachment #544018 - Flags: review?(dustin) → review+

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

14 years ago

Comment on attachment 544015 [details] [diff] [review] OOP pulse publisher implementation Review of attachment 544015 [details] [diff] [review]: ----------------------------------------------------------------- Rather than introduce a new top-level module, I think it'd be better to put queuedir under a package - lib/python/buildtools, perhaps? Also, double-check that setup.py catches queuedir.py, and add the pyinotify dependency. ::: buildbot-helpers/command_runner.py @@ +68,5 @@ > + """ > + Runs the given job > + """ > + log.info("Running %s", job.cmd) > + output = self.q.getlog(job.item_id) This is already passed to the Job constructor by loop(), and is only used in the exception case. Maybe job.start should do the exception logging (since it has the log), and this method should just andle self.active and requeueing on an OSError? @@ +83,5 @@ > + """ > + Kill jobs that have run too long > + """ > + now = time.time() > + for sigtime, level, p in self.to_kill: It looks like this is redundant to Job.cancel, and self.to_kill is never populated. ::: buildbot-helpers/pulse_publisher.py @@ +133,5 @@ > + self._last_connection = None > + self._last_activity = None > + > + def loop(self): > + """ This looks a lot like (but not the same as) the same method in command_runner.py. Is there room for sharing code? ::: lib/python/queuedir.py @@ +89,5 @@ > + > + ### > + # For producers > + ### > + def add(self, data, retries=0): It doesn't look like `retries` is ever used in this method? Were you intending to count down from `retries` instead of up from 0? @@ +130,5 @@ > + os.rename(os.path.join(self.new_dir, item), dst_name) > + os.utime(dst_name, None) > + return item, open(dst_name, 'rb') > + except OSError: > + pass This is probably worth logging? @@ +173,5 @@ > + """ > + return open(self.getlogname(item_id), "a+") > + > + def log(self, item_id, msg): > + self.getlog(item_id).write(msg) Should this add a newline? @@ +187,5 @@ > + return > + now = time.time() > + for t, item_id, max_retries in self.to_requeue[:]: > + if now > t: > + self.requeue(item_id, max_retries=max_retries) Do you need to pass max_retries here? It was already checked with the item was added to self.to_requeue. @@ +220,5 @@ > + return > + > + if delay: > + self.to_requeue.append( (time.time() + delay, item_id, max_retries) ) > + self.to_requeue.sort() how large do you expect this queue to get? Would it make sense to use a heapq? That's probably premature optimization, but I thought I'd ask. @@ +238,5 @@ > + dst_name = os.path.join(self.dead_dir, "%s.log" % item_id) > + os.rename(self.getlogname(item_id), dst_name) > + > + if pyinotify: > + def wait(self, timeout=None): The other code calls wait() unconditionally, but it's not even defined if pyinotify isn't found. Is there a reason to keep this conditional here, or can you just assume pyinotify is installed?

Attachment #544015 - Flags: review?(dustin) → review-

Chris AtLee [:catlee]

Assignee

Comment 11

•

14 years ago

Attached patch OOP pulse publisher implementation (obsolete) — Details — Splinter Review

Attachment #544015 - Attachment is obsolete: true

Attachment #552769 - Flags: review?(dustin)

Chris AtLee [:catlee]

Assignee

Comment 12

•

14 years ago

First off, thanks for the very thorough review! (In reply to Dustin J. Mitchell [:dustin] from comment #10) > Comment on attachment 544015 [details] [diff] [review] > OOP pulse publisher implementation > > Review of attachment 544015 [details] [diff] [review]: > ----------------------------------------------------------------- > > Rather than introduce a new top-level module, I think it'd be better to put > queuedir under a package - lib/python/buildtools, perhaps? Also, > double-check that setup.py catches queuedir.py, and add the pyinotify > dependency. I moved it into lib/python/buildtools/queuedir.py (which does get picked up by setup.py). The pyinotify dependency is now optional. > ::: buildbot-helpers/command_runner.py > @@ +68,5 @@ > > + """ > > + Runs the given job > > + """ > > + log.info("Running %s", job.cmd) > > + output = self.q.getlog(job.item_id) > > This is already passed to the Job constructor by loop(), and is only used in > the exception case. Maybe job.start should do the exception logging (since > it has the log), and this method should just andle self.active and > requeueing on an OSError? I think the exception handling is cleaner here, but I've reused the existing log file object. > @@ +83,5 @@ > > + """ > > + Kill jobs that have run too long > > + """ > > + now = time.time() > > + for sigtime, level, p in self.to_kill: > > It looks like this is redundant to Job.cancel, and self.to_kill is never > populated. Yep, removed this. > > ::: buildbot-helpers/pulse_publisher.py > @@ +133,5 @@ > > + self._last_connection = None > > + self._last_activity = None > > + > > + def loop(self): > > + """ > > This looks a lot like (but not the same as) the same method in > command_runner.py. Is there room for sharing code? > > ::: lib/python/queuedir.py > @@ +89,5 @@ > > + > > + ### > > + # For producers > > + ### > > + def add(self, data, retries=0): > > It doesn't look like `retries` is ever used in this method? Were you > intending to count down from `retries` instead of up from 0? yeah, originally. I've removed this parameter here. > @@ +130,5 @@ > > + os.rename(os.path.join(self.new_dir, item), dst_name) > > + os.utime(dst_name, None) > > + return item, open(dst_name, 'rb') > > + except OSError: > > + pass > > This is probably worth logging? This would be fairly common with multiple consumers, so didn't think it was worth logging. > @@ +173,5 @@ > > + """ > > + return open(self.getlogname(item_id), "a+") > > + > > + def log(self, item_id, msg): > > + self.getlog(item_id).write(msg) > > Should this add a newline? Yup. > @@ +187,5 @@ > > + return > > + now = time.time() > > + for t, item_id, max_retries in self.to_requeue[:]: > > + if now > t: > > + self.requeue(item_id, max_retries=max_retries) > > Do you need to pass max_retries here? It was already checked with the item > was added to self.to_requeue. removed > @@ +220,5 @@ > > + return > > + > > + if delay: > > + self.to_requeue.append( (time.time() + delay, item_id, max_retries) ) > > + self.to_requeue.sort() > > how large do you expect this queue to get? Would it make sense to use a > heapq? That's probably premature optimization, but I thought I'd ask. hopefully not very large! in the case where pulse or stage goes down we'd be accumulating hundreds of items. > @@ +238,5 @@ > > + dst_name = os.path.join(self.dead_dir, "%s.log" % item_id) > > + os.rename(self.getlogname(item_id), dst_name) > > + > > + if pyinotify: > > + def wait(self, timeout=None): > > The other code calls wait() unconditionally, but it's not even defined if > pyinotify isn't found. Is there a reason to keep this conditional here, or > can you just assume pyinotify is installed? I added wait() implementation that doesn't rely on pyinotify.

Chris AtLee [:catlee]

Assignee

Comment 13

•

14 years ago

Attached patch OOP pulse publisher implementation — Details — Splinter Review

attached wrong patch, sorry!

Attachment #552769 - Attachment is obsolete: true

Attachment #552769 - Flags: review?(dustin)

Attachment #552771 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 14

•

14 years ago

Comment on attachment 552771 [details] [diff] [review] OOP pulse publisher implementation Looks good!

Attachment #552771 - Flags: review?(dustin) → review+

Chris AtLee [:catlee]

Assignee

Comment 15

•

14 years ago

Attached patch puppet manifests for queue processors — Details — Splinter Review

these manifests set up a /builds/buildbot/queue virtualenv, and run two daemons out of it: command_runner and pulse_publisher. it also adds nagios checks that the two daemons are running.

Attachment #553466 - Flags: review?(bear)

Mike Taylor [:bear]

Comment 16

•

14 years ago

Comment on attachment 553466 [details] [diff] [review] puppet manifests for queue processors r+ but with a "wow, big patch - let's test it well" caveat

Attachment #553466 - Flags: review?(bear) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

14 years ago

Attachment #552771 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Updated

•

14 years ago

Attachment #553466 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Updated

•

14 years ago

Attachment #544018 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Comment 17

•

14 years ago

Attached patch updated configs to allow you to override where the queuedir lives — Details — Splinter Review

Attachment #544017 - Attachment is obsolete: true

Attachment #559215 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

14 years ago

Attachment #559215 - Flags: review?(bhearsum) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

14 years ago

Attachment #559215 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Updated

•

14 years ago

Flags: needs-reconfig?

Armen [:armenzg]

Comment 18

•

14 years ago

Merged to production and reconfigured.

Flags: needs-reconfig? → needs-reconfig+

Chris AtLee [:catlee]

Assignee

Comment 19

•

14 years ago

Attached patch nagios checks that queues are healthy — Details — Splinter Review

Attachment #559484 - Flags: review?(bear)

Aki Sasaki (not active)

Updated

•

14 years ago

Flags: needs-reconfig+

bhearsum@mozilla.com (:bhearsum)

Comment 20

•

14 years ago

This landed in production on the 9th.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Mike Taylor [:bear]

Updated

•

14 years ago

Attachment #559484 - Flags: review?(bear) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

14 years ago

Attachment #559484 - Flags: checked-in+

Nick Thomas [:nthomas] (UTC+12)

Comment 21

•

14 years ago

Comment on attachment 552771 [details] [diff] [review] OOP pulse publisher implementation >diff --git a/buildbot-helpers/check_queuedir.py b/buildbot-helpers/check_queuedir.py >+def check_queuedir(d, options): ... >+ # Check 'dead' >+ num_dead = len(os.listdir(os.path.join(d, 'dead'))) >+ if num_dead > 0: >+ status = CRITICAL >+ msgs.append("%i dead items" % num_dead) Because there are two files per failure (the command, and the .log file) you could say the nagios alerts using this over count by a factor of two.

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.