If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Make PulseStatus not use threads

RESOLVED FIXED

Status

Release Engineering
General
P2
normal
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: catlee, Assigned: catlee)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(5 attachments, 4 obsolete attachments)

(Assignee)

Description

6 years ago
Threads are suspected as a cause of huge memory usage. Let's see if we can make PulseStatus not use threads.
(Assignee)

Comment 1

6 years ago
Created attachment 543941 [details] [diff] [review]
OOP pulse implementation
(Assignee)

Comment 2

6 years ago
Created attachment 544015 [details] [diff] [review]
OOP pulse publisher implementation
Attachment #543941 - Attachment is obsolete: true
Attachment #544015 - Flags: review?
(Assignee)

Comment 3

6 years ago
Created attachment 544017 [details] [diff] [review]
set up queuedir instances in master_common.py
Attachment #544017 - Flags: review?
(Assignee)

Comment 4

6 years ago
Created attachment 544018 [details] [diff] [review]
rip out old pulse/log uploading from buildbotcustom
Attachment #544018 - Flags: review?
(Assignee)

Updated

6 years ago
Attachment #544015 - Flags: review?(rail)
Attachment #544015 - Flags: review?(dustin)
Attachment #544015 - Flags: review?
(Assignee)

Comment 5

6 years ago
Comment on attachment 544017 [details] [diff] [review]
set up queuedir instances in master_common.py

I'm not thrilled with this. If you have other ideas, I'm totally open to suggestions.
Attachment #544017 - Flags: review? → review?(bhearsum)
(Assignee)

Updated

6 years ago
Attachment #544018 - Flags: review?(dustin)
Attachment #544018 - Flags: review?(bhearsum)
Attachment #544018 - Flags: review?
Comment on attachment 544018 [details] [diff] [review]
rip out old pulse/log uploading from buildbotcustom

Review of attachment 544018 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #544018 - Flags: review?(bhearsum) → review+
Comment on attachment 544017 [details] [diff] [review]
set up queuedir instances in master_common.py

Review of attachment 544017 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla-tests/scheduler_master.cfg
@@ +6,4 @@
>  import master_common
>  import config
>  import master_localconfig
> +reload(buildbotcustom.misc)

Out of curiosity, why is this getting rearranged?
Attachment #544017 - Flags: review?(bhearsum) → review+
Comment on attachment 544015 [details] [diff] [review]
OOP pulse publisher implementation

Overall it looks good. My only concern here is the fact that the directories are flat, so we may have performance problems for long queues (thousands of jobs).
Attachment #544015 - Flags: review?(rail) → review+
Comment on attachment 544018 [details] [diff] [review]
rip out old pulse/log uploading from buildbotcustom

Review of attachment 544018 [details] [diff] [review]:
-----------------------------------------------------------------

::: status/pulse.py
@@ +98,4 @@
>  
>      def pushEvents(self):
> +        """Trigger a push"""
> +        self.delayed_push = reactor.callLater(self.push_delay, self._do_push)

It looks like this used to be protected by a DeferredLock.  Would it make sense here to only set delayed_push if it's not already set, so that multiple pushes aren't queued simultaneously?  Such multiple pushes wouldn't be a problem (since _do_push is atomic), but they wouldn't be properly cancelled by stopService.

::: status/queued_command.py
@@ +33,5 @@
> +    def disownServiceParent(self):
> +        self.master_status.unsubscribe(self)
> +        for w in self.watched:
> +            w.unsubscribe(self)
> +        return base.StatusReceiverMultiService.disownServiceParent(self)

I think we worried about this on another patch - setServiceParent/disownServiceParent are not the best place to do startService/stopService operations.  I think this is modeled on other code in buildbotcustom, and it *works*, so I'm OK with this part of the patch as-is, but I thought I'd mention it.

@@ +36,5 @@
> +            w.unsubscribe(self)
> +        return base.StatusReceiverMultiService.disownServiceParent(self)
> +
> +    def stopService(self):
> +        base.StatusReceiverMultiService.stopService(self)

This is redundant if it's just calling the parent method.
Attachment #544018 - Flags: review?(dustin) → review+
Comment on attachment 544015 [details] [diff] [review]
OOP pulse publisher implementation

Review of attachment 544015 [details] [diff] [review]:
-----------------------------------------------------------------

Rather than introduce a new top-level module, I think it'd be better to put queuedir under a package - lib/python/buildtools, perhaps?  Also, double-check that setup.py catches queuedir.py, and add the pyinotify dependency.

::: buildbot-helpers/command_runner.py
@@ +68,5 @@
> +        """
> +        Runs the given job
> +        """
> +        log.info("Running %s", job.cmd)
> +        output = self.q.getlog(job.item_id)

This is already passed to the Job constructor by loop(), and is only used in the exception case.  Maybe job.start should do the exception logging (since it has the log), and this method should just andle self.active and requeueing on an OSError?

@@ +83,5 @@
> +        """
> +        Kill jobs that have run too long
> +        """
> +        now = time.time()
> +        for sigtime, level, p in self.to_kill:

It looks like this is redundant to Job.cancel, and self.to_kill is never populated.

::: buildbot-helpers/pulse_publisher.py
@@ +133,5 @@
> +            self._last_connection = None
> +            self._last_activity = None
> +
> +    def loop(self):
> +        """

This looks a lot like (but not the same as) the same method in command_runner.py.  Is there room for sharing code?

::: lib/python/queuedir.py
@@ +89,5 @@
> +
> +    ###
> +    # For producers
> +    ###
> +    def add(self, data, retries=0):

It doesn't look like `retries` is ever used in this method?  Were you intending to count down from `retries` instead of up from 0?

@@ +130,5 @@
> +                os.rename(os.path.join(self.new_dir, item), dst_name)
> +                os.utime(dst_name, None)
> +                return item, open(dst_name, 'rb')
> +            except OSError:
> +                pass

This is probably worth logging?

@@ +173,5 @@
> +        """
> +        return open(self.getlogname(item_id), "a+")
> +
> +    def log(self, item_id, msg):
> +        self.getlog(item_id).write(msg)

Should this add a newline?

@@ +187,5 @@
> +            return
> +        now = time.time()
> +        for t, item_id, max_retries in self.to_requeue[:]:
> +            if now > t:
> +                self.requeue(item_id, max_retries=max_retries)

Do you need to pass max_retries here?  It was already checked with the item was added to self.to_requeue.

@@ +220,5 @@
> +            return
> +
> +        if delay:
> +            self.to_requeue.append( (time.time() + delay, item_id, max_retries) )
> +            self.to_requeue.sort()

how large do you expect this queue to get?  Would it make sense to use a heapq?  That's probably premature optimization, but I thought I'd ask.

@@ +238,5 @@
> +            dst_name = os.path.join(self.dead_dir, "%s.log" % item_id)
> +            os.rename(self.getlogname(item_id), dst_name)
> +
> +    if pyinotify:
> +        def wait(self, timeout=None):

The other code calls wait() unconditionally, but it's not even defined if pyinotify isn't found.  Is there a reason to keep this conditional here, or can you just assume pyinotify is installed?
Attachment #544015 - Flags: review?(dustin) → review-
(Assignee)

Comment 11

6 years ago
Created attachment 552769 [details] [diff] [review]
OOP pulse publisher implementation
Attachment #544015 - Attachment is obsolete: true
Attachment #552769 - Flags: review?(dustin)
(Assignee)

Comment 12

6 years ago
First off, thanks for the very thorough review!

(In reply to Dustin J. Mitchell [:dustin] from comment #10)
> Comment on attachment 544015 [details] [diff] [review]
> OOP pulse publisher implementation
> 
> Review of attachment 544015 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Rather than introduce a new top-level module, I think it'd be better to put
> queuedir under a package - lib/python/buildtools, perhaps?  Also,
> double-check that setup.py catches queuedir.py, and add the pyinotify
> dependency.

I moved it into lib/python/buildtools/queuedir.py (which does get picked up by setup.py). The pyinotify dependency is now optional.

> ::: buildbot-helpers/command_runner.py
> @@ +68,5 @@
> > +        """
> > +        Runs the given job
> > +        """
> > +        log.info("Running %s", job.cmd)
> > +        output = self.q.getlog(job.item_id)
> 
> This is already passed to the Job constructor by loop(), and is only used in
> the exception case.  Maybe job.start should do the exception logging (since
> it has the log), and this method should just andle self.active and
> requeueing on an OSError?

I think the exception handling is cleaner here, but I've reused the existing log file object.

> @@ +83,5 @@
> > +        """
> > +        Kill jobs that have run too long
> > +        """
> > +        now = time.time()
> > +        for sigtime, level, p in self.to_kill:
> 
> It looks like this is redundant to Job.cancel, and self.to_kill is never
> populated.

Yep, removed this.

> 
> ::: buildbot-helpers/pulse_publisher.py
> @@ +133,5 @@
> > +            self._last_connection = None
> > +            self._last_activity = None
> > +
> > +    def loop(self):
> > +        """
> 
> This looks a lot like (but not the same as) the same method in
> command_runner.py.  Is there room for sharing code?
>
> ::: lib/python/queuedir.py
> @@ +89,5 @@
> > +
> > +    ###
> > +    # For producers
> > +    ###
> > +    def add(self, data, retries=0):
> 
> It doesn't look like `retries` is ever used in this method?  Were you
> intending to count down from `retries` instead of up from 0?

yeah, originally. I've removed this parameter here.

> @@ +130,5 @@
> > +                os.rename(os.path.join(self.new_dir, item), dst_name)
> > +                os.utime(dst_name, None)
> > +                return item, open(dst_name, 'rb')
> > +            except OSError:
> > +                pass
> 
> This is probably worth logging?

This would be fairly common with multiple consumers, so didn't think it was worth logging.

> @@ +173,5 @@
> > +        """
> > +        return open(self.getlogname(item_id), "a+")
> > +
> > +    def log(self, item_id, msg):
> > +        self.getlog(item_id).write(msg)
> 
> Should this add a newline?

Yup.

> @@ +187,5 @@
> > +            return
> > +        now = time.time()
> > +        for t, item_id, max_retries in self.to_requeue[:]:
> > +            if now > t:
> > +                self.requeue(item_id, max_retries=max_retries)
> 
> Do you need to pass max_retries here?  It was already checked with the item
> was added to self.to_requeue.

removed

> @@ +220,5 @@
> > +            return
> > +
> > +        if delay:
> > +            self.to_requeue.append( (time.time() + delay, item_id, max_retries) )
> > +            self.to_requeue.sort()
> 
> how large do you expect this queue to get?  Would it make sense to use a
> heapq?  That's probably premature optimization, but I thought I'd ask.

hopefully not very large! in the case where pulse or stage goes down we'd be accumulating hundreds of items.

> @@ +238,5 @@
> > +            dst_name = os.path.join(self.dead_dir, "%s.log" % item_id)
> > +            os.rename(self.getlogname(item_id), dst_name)
> > +
> > +    if pyinotify:
> > +        def wait(self, timeout=None):
> 
> The other code calls wait() unconditionally, but it's not even defined if
> pyinotify isn't found.  Is there a reason to keep this conditional here, or
> can you just assume pyinotify is installed?

I added wait() implementation that doesn't rely on pyinotify.
(Assignee)

Comment 13

6 years ago
Created attachment 552771 [details] [diff] [review]
OOP pulse publisher implementation

attached wrong patch, sorry!
Attachment #552769 - Attachment is obsolete: true
Attachment #552769 - Flags: review?(dustin)
Attachment #552771 - Flags: review?(dustin)
Comment on attachment 552771 [details] [diff] [review]
OOP pulse publisher implementation

Looks good!
Attachment #552771 - Flags: review?(dustin) → review+
(Assignee)

Comment 15

6 years ago
Created attachment 553466 [details] [diff] [review]
puppet manifests for queue processors

these manifests set up a /builds/buildbot/queue virtualenv, and run two daemons out of it: command_runner and pulse_publisher.

it also adds nagios checks that the two daemons are running.
Attachment #553466 - Flags: review?(bear)

Comment 16

6 years ago
Comment on attachment 553466 [details] [diff] [review]
puppet manifests for queue processors

r+ but with a "wow, big patch - let's test it well" caveat
Attachment #553466 - Flags: review?(bear) → review+
(Assignee)

Updated

6 years ago
Attachment #552771 - Flags: checked-in+
(Assignee)

Updated

6 years ago
Attachment #553466 - Flags: checked-in+
(Assignee)

Updated

6 years ago
Attachment #544018 - Flags: checked-in+
(Assignee)

Comment 17

6 years ago
Created attachment 559215 [details] [diff] [review]
updated configs to allow you to override where the queuedir lives
Attachment #544017 - Attachment is obsolete: true
Attachment #559215 - Flags: review?(bhearsum)
Attachment #559215 - Flags: review?(bhearsum) → review+
(Assignee)

Updated

6 years ago
Attachment #559215 - Flags: checked-in+
(Assignee)

Updated

6 years ago
Flags: needs-reconfig?

Comment 18

6 years ago
Merged to production and reconfigured.
Flags: needs-reconfig? → needs-reconfig+
(Assignee)

Comment 19

6 years ago
Created attachment 559484 [details] [diff] [review]
nagios checks that queues are healthy
Attachment #559484 - Flags: review?(bear)

Updated

6 years ago
Flags: needs-reconfig+
This landed in production on the 9th.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED

Updated

6 years ago
Attachment #559484 - Flags: review?(bear) → review+
(Assignee)

Updated

6 years ago
Attachment #559484 - Flags: checked-in+
Comment on attachment 552771 [details] [diff] [review]
OOP pulse publisher implementation

>diff --git a/buildbot-helpers/check_queuedir.py b/buildbot-helpers/check_queuedir.py
>+def check_queuedir(d, options):
...
>+    # Check 'dead'
>+    num_dead = len(os.listdir(os.path.join(d, 'dead')))
>+    if num_dead > 0:
>+        status = CRITICAL
>+        msgs.append("%i dead items" % num_dead)

Because there are two files per failure (the command, and the .log file) you could say the nagios alerts using this over count by a factor of two.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.