Closed Bug 662885 Opened 13 years ago Closed 13 years ago

Make PulseStatus not use threads

Categories

(Release Engineering :: General, defect, P2)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(5 files, 4 obsolete files)

Threads are suspected as a cause of huge memory usage. Let's see if we can make PulseStatus not use threads.
Attached patch OOP pulse implementation (obsolete) — Splinter Review
Attachment #543941 - Attachment is obsolete: true
Attachment #544015 - Flags: review?
Attachment #544017 - Flags: review?
Attachment #544015 - Flags: review?(rail)
Attachment #544015 - Flags: review?(dustin)
Attachment #544015 - Flags: review?
Comment on attachment 544017 [details] [diff] [review]
set up queuedir instances in master_common.py

I'm not thrilled with this. If you have other ideas, I'm totally open to suggestions.
Attachment #544017 - Flags: review? → review?(bhearsum)
Attachment #544018 - Flags: review?(dustin)
Attachment #544018 - Flags: review?(bhearsum)
Attachment #544018 - Flags: review?
Comment on attachment 544018 [details] [diff] [review]
rip out old pulse/log uploading from buildbotcustom

Review of attachment 544018 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #544018 - Flags: review?(bhearsum) → review+
Comment on attachment 544017 [details] [diff] [review]
set up queuedir instances in master_common.py

Review of attachment 544017 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla-tests/scheduler_master.cfg
@@ +6,4 @@
>  import master_common
>  import config
>  import master_localconfig
> +reload(buildbotcustom.misc)

Out of curiosity, why is this getting rearranged?
Attachment #544017 - Flags: review?(bhearsum) → review+
Comment on attachment 544015 [details] [diff] [review]
OOP pulse publisher implementation

Overall it looks good. My only concern here is the fact that the directories are flat, so we may have performance problems for long queues (thousands of jobs).
Attachment #544015 - Flags: review?(rail) → review+
Comment on attachment 544018 [details] [diff] [review]
rip out old pulse/log uploading from buildbotcustom

Review of attachment 544018 [details] [diff] [review]:
-----------------------------------------------------------------

::: status/pulse.py
@@ +98,4 @@
>  
>      def pushEvents(self):
> +        """Trigger a push"""
> +        self.delayed_push = reactor.callLater(self.push_delay, self._do_push)

It looks like this used to be protected by a DeferredLock.  Would it make sense here to only set delayed_push if it's not already set, so that multiple pushes aren't queued simultaneously?  Such multiple pushes wouldn't be a problem (since _do_push is atomic), but they wouldn't be properly cancelled by stopService.

::: status/queued_command.py
@@ +33,5 @@
> +    def disownServiceParent(self):
> +        self.master_status.unsubscribe(self)
> +        for w in self.watched:
> +            w.unsubscribe(self)
> +        return base.StatusReceiverMultiService.disownServiceParent(self)

I think we worried about this on another patch - setServiceParent/disownServiceParent are not the best place to do startService/stopService operations.  I think this is modeled on other code in buildbotcustom, and it *works*, so I'm OK with this part of the patch as-is, but I thought I'd mention it.

@@ +36,5 @@
> +            w.unsubscribe(self)
> +        return base.StatusReceiverMultiService.disownServiceParent(self)
> +
> +    def stopService(self):
> +        base.StatusReceiverMultiService.stopService(self)

This is redundant if it's just calling the parent method.
Attachment #544018 - Flags: review?(dustin) → review+
Comment on attachment 544015 [details] [diff] [review]
OOP pulse publisher implementation

Review of attachment 544015 [details] [diff] [review]:
-----------------------------------------------------------------

Rather than introduce a new top-level module, I think it'd be better to put queuedir under a package - lib/python/buildtools, perhaps?  Also, double-check that setup.py catches queuedir.py, and add the pyinotify dependency.

::: buildbot-helpers/command_runner.py
@@ +68,5 @@
> +        """
> +        Runs the given job
> +        """
> +        log.info("Running %s", job.cmd)
> +        output = self.q.getlog(job.item_id)

This is already passed to the Job constructor by loop(), and is only used in the exception case.  Maybe job.start should do the exception logging (since it has the log), and this method should just andle self.active and requeueing on an OSError?

@@ +83,5 @@
> +        """
> +        Kill jobs that have run too long
> +        """
> +        now = time.time()
> +        for sigtime, level, p in self.to_kill:

It looks like this is redundant to Job.cancel, and self.to_kill is never populated.

::: buildbot-helpers/pulse_publisher.py
@@ +133,5 @@
> +            self._last_connection = None
> +            self._last_activity = None
> +
> +    def loop(self):
> +        """

This looks a lot like (but not the same as) the same method in command_runner.py.  Is there room for sharing code?

::: lib/python/queuedir.py
@@ +89,5 @@
> +
> +    ###
> +    # For producers
> +    ###
> +    def add(self, data, retries=0):

It doesn't look like `retries` is ever used in this method?  Were you intending to count down from `retries` instead of up from 0?

@@ +130,5 @@
> +                os.rename(os.path.join(self.new_dir, item), dst_name)
> +                os.utime(dst_name, None)
> +                return item, open(dst_name, 'rb')
> +            except OSError:
> +                pass

This is probably worth logging?

@@ +173,5 @@
> +        """
> +        return open(self.getlogname(item_id), "a+")
> +
> +    def log(self, item_id, msg):
> +        self.getlog(item_id).write(msg)

Should this add a newline?

@@ +187,5 @@
> +            return
> +        now = time.time()
> +        for t, item_id, max_retries in self.to_requeue[:]:
> +            if now > t:
> +                self.requeue(item_id, max_retries=max_retries)

Do you need to pass max_retries here?  It was already checked with the item was added to self.to_requeue.

@@ +220,5 @@
> +            return
> +
> +        if delay:
> +            self.to_requeue.append( (time.time() + delay, item_id, max_retries) )
> +            self.to_requeue.sort()

how large do you expect this queue to get?  Would it make sense to use a heapq?  That's probably premature optimization, but I thought I'd ask.

@@ +238,5 @@
> +            dst_name = os.path.join(self.dead_dir, "%s.log" % item_id)
> +            os.rename(self.getlogname(item_id), dst_name)
> +
> +    if pyinotify:
> +        def wait(self, timeout=None):

The other code calls wait() unconditionally, but it's not even defined if pyinotify isn't found.  Is there a reason to keep this conditional here, or can you just assume pyinotify is installed?
Attachment #544015 - Flags: review?(dustin) → review-
Attachment #544015 - Attachment is obsolete: true
Attachment #552769 - Flags: review?(dustin)
First off, thanks for the very thorough review!

(In reply to Dustin J. Mitchell [:dustin] from comment #10)
> Comment on attachment 544015 [details] [diff] [review]
> OOP pulse publisher implementation
> 
> Review of attachment 544015 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Rather than introduce a new top-level module, I think it'd be better to put
> queuedir under a package - lib/python/buildtools, perhaps?  Also,
> double-check that setup.py catches queuedir.py, and add the pyinotify
> dependency.

I moved it into lib/python/buildtools/queuedir.py (which does get picked up by setup.py). The pyinotify dependency is now optional.

> ::: buildbot-helpers/command_runner.py
> @@ +68,5 @@
> > +        """
> > +        Runs the given job
> > +        """
> > +        log.info("Running %s", job.cmd)
> > +        output = self.q.getlog(job.item_id)
> 
> This is already passed to the Job constructor by loop(), and is only used in
> the exception case.  Maybe job.start should do the exception logging (since
> it has the log), and this method should just andle self.active and
> requeueing on an OSError?

I think the exception handling is cleaner here, but I've reused the existing log file object.

> @@ +83,5 @@
> > +        """
> > +        Kill jobs that have run too long
> > +        """
> > +        now = time.time()
> > +        for sigtime, level, p in self.to_kill:
> 
> It looks like this is redundant to Job.cancel, and self.to_kill is never
> populated.

Yep, removed this.

> 
> ::: buildbot-helpers/pulse_publisher.py
> @@ +133,5 @@
> > +            self._last_connection = None
> > +            self._last_activity = None
> > +
> > +    def loop(self):
> > +        """
> 
> This looks a lot like (but not the same as) the same method in
> command_runner.py.  Is there room for sharing code?
>
> ::: lib/python/queuedir.py
> @@ +89,5 @@
> > +
> > +    ###
> > +    # For producers
> > +    ###
> > +    def add(self, data, retries=0):
> 
> It doesn't look like `retries` is ever used in this method?  Were you
> intending to count down from `retries` instead of up from 0?

yeah, originally. I've removed this parameter here.

> @@ +130,5 @@
> > +                os.rename(os.path.join(self.new_dir, item), dst_name)
> > +                os.utime(dst_name, None)
> > +                return item, open(dst_name, 'rb')
> > +            except OSError:
> > +                pass
> 
> This is probably worth logging?

This would be fairly common with multiple consumers, so didn't think it was worth logging.

> @@ +173,5 @@
> > +        """
> > +        return open(self.getlogname(item_id), "a+")
> > +
> > +    def log(self, item_id, msg):
> > +        self.getlog(item_id).write(msg)
> 
> Should this add a newline?

Yup.

> @@ +187,5 @@
> > +            return
> > +        now = time.time()
> > +        for t, item_id, max_retries in self.to_requeue[:]:
> > +            if now > t:
> > +                self.requeue(item_id, max_retries=max_retries)
> 
> Do you need to pass max_retries here?  It was already checked with the item
> was added to self.to_requeue.

removed

> @@ +220,5 @@
> > +            return
> > +
> > +        if delay:
> > +            self.to_requeue.append( (time.time() + delay, item_id, max_retries) )
> > +            self.to_requeue.sort()
> 
> how large do you expect this queue to get?  Would it make sense to use a
> heapq?  That's probably premature optimization, but I thought I'd ask.

hopefully not very large! in the case where pulse or stage goes down we'd be accumulating hundreds of items.

> @@ +238,5 @@
> > +            dst_name = os.path.join(self.dead_dir, "%s.log" % item_id)
> > +            os.rename(self.getlogname(item_id), dst_name)
> > +
> > +    if pyinotify:
> > +        def wait(self, timeout=None):
> 
> The other code calls wait() unconditionally, but it's not even defined if
> pyinotify isn't found.  Is there a reason to keep this conditional here, or
> can you just assume pyinotify is installed?

I added wait() implementation that doesn't rely on pyinotify.
attached wrong patch, sorry!
Attachment #552769 - Attachment is obsolete: true
Attachment #552769 - Flags: review?(dustin)
Attachment #552771 - Flags: review?(dustin)
Comment on attachment 552771 [details] [diff] [review]
OOP pulse publisher implementation

Looks good!
Attachment #552771 - Flags: review?(dustin) → review+
these manifests set up a /builds/buildbot/queue virtualenv, and run two daemons out of it: command_runner and pulse_publisher.

it also adds nagios checks that the two daemons are running.
Attachment #553466 - Flags: review?(bear)
Comment on attachment 553466 [details] [diff] [review]
puppet manifests for queue processors

r+ but with a "wow, big patch - let's test it well" caveat
Attachment #553466 - Flags: review?(bear) → review+
Attachment #552771 - Flags: checked-in+
Attachment #553466 - Flags: checked-in+
Attachment #544018 - Flags: checked-in+
Attachment #544017 - Attachment is obsolete: true
Attachment #559215 - Flags: review?(bhearsum)
Attachment #559215 - Flags: review?(bhearsum) → review+
Attachment #559215 - Flags: checked-in+
Flags: needs-reconfig?
Merged to production and reconfigured.
Flags: needs-reconfig? → needs-reconfig+
Attachment #559484 - Flags: review?(bear)
Flags: needs-reconfig+
This landed in production on the 9th.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Attachment #559484 - Flags: review?(bear) → review+
Attachment #559484 - Flags: checked-in+
Comment on attachment 552771 [details] [diff] [review]
OOP pulse publisher implementation

>diff --git a/buildbot-helpers/check_queuedir.py b/buildbot-helpers/check_queuedir.py
>+def check_queuedir(d, options):
...
>+    # Check 'dead'
>+    num_dead = len(os.listdir(os.path.join(d, 'dead')))
>+    if num_dead > 0:
>+        status = CRITICAL
>+        msgs.append("%i dead items" % num_dead)

Because there are two files per failure (the command, and the .log file) you could say the nagios alerts using this over count by a factor of two.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: