Closed Bug 832893 Opened 8 years ago Closed 8 years ago

change jobqueue.pl to spawn worker processes to deliver bugmail to avoid memory leaks

Categories

(Bugzilla :: Email Notifications, defect)

4.0.9
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Bugzilla 4.4

People

(Reporter: glob, Assigned: glob)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 2 obsolete files)

recently we've seen several occurrences of jobqueue.pl consuming large amounts of memory, pushing the webheads into swap.

for example, most recently had jobqueues consuming 1.4g each.


given the push frequency, we're not sure if this is a new issue, or it's a recent change.
As we start caching more and more objects I can imaging it having more stuff cached in memory the longer it runs.  I know we do a bunch of cleanup stuff on each request when handling web pages, but jobqueue isn't running inside a web request, so it's possible that cleanup is never getting done...  and this is almost surely an upstream issue.
Assignee: nobody → email-notifications
Component: General → Email Notifications
Product: bugzilla.mozilla.org → Bugzilla
QA Contact: default-qa
Version: Production → 4.0.9
the cleanup happens via bug 822547.

i want to investigate this on bmo first, due to the number of extensions.  i will shift upstream if investigations point to it being an upstream issue.
Assignee: email-notifications → glob
Component: Email Notifications → General
Product: Bugzilla → bugzilla.mozilla.org
QA Contact: default-qa
Version: 4.0.9 → Production
aside from higher baseline memory usage, i don't see bmo's jobqueue.pl consuming memory at a higher rate than trunk's (in fact it was less).

it's hard to tell if it's a memory leak, or just fragmentation due to all the small allocations perl does.  the cached stuff is cleared for each run, and i haven't been able to create a circular reference in the cache.

i think it may be easier to change how jobqueue.pl works..

while (1) {
  if (there are pending jobs in the queue) {
    system "jobqueue.pl -f process_pending_jobs"
  }
  sleep 5
}
Component: General → Bugzilla-General
Product: bugzilla.mozilla.org → Bugzilla
QA Contact: default-qa
Version: Production → 4.0.9
Attached patch patch v1 (obsolete) — Splinter Review
Attachment #704786 - Flags: review?(LpSolit)
Summary: jobqueue.pl looks like it's leaking memory → change jobqueue.pl to spawn worker processes to deliver bugmail to avoid memory leaks
Comment on attachment 704786 [details] [diff] [review]
patch v1

I'm really not the right person to review this patch. I cannot test jobqueue nor do I use mod_perl.


>=== modified file 'Bugzilla/JobQueue.pm'

> use parent qw(TheSchwartz);
>+use fields qw(_pidfile);

From what http://perldoc.perl.org/base.html says:

"Unless you are using the fields pragma, consider this module discouraged in favor of the lighter-weight parent."

To me, this means that if we use the fields pragma, you cannot use parent.


That's the single comment I can make. :)
Attachment #704786 - Flags: review?(LpSolit) → review?(justdave)
(In reply to Frédéric Buclin from comment #5)
> I cannot test jobqueue nor do I use mod_perl.

fwiw this isn't related to mod_perl.

i test jobqueue by setting use_mailer_queue to 1, triggering bugmail, and then running |./jobqueue.pl -d -f start|

to test this change, i wrote a simple script which calls MessageToMTA() hundreds of times.

> > use parent qw(TheSchwartz);
> >+use fields qw(_pidfile);
> 
> From what http://perldoc.perl.org/base.html says:
> 
> "Unless you are using the fields pragma, consider this module discouraged in
> favor of the lighter-weight parent."
> 
> To me, this means that if we use the fields pragma, you cannot use parent.

nice catch (although it appears to work fine with "parent").
i've changed "use parent" to "use base" locally, will hold off on uploading a new patch until the current one is reviewed.
Comment on attachment 704786 [details] [diff] [review]
patch v1

Review of attachment 704786 [details] [diff] [review]:
-----------------------------------------------------------------

Nice. Looks fine and works for my testing. Except for what LpSolit mentioned, r=dkl
Attachment #704786 - Flags: review+
Flags: approval?
during some more testing, i realized this patch has problems with stopping.

if a sub-process worker is running, and you ask the daemon to stop, it waits until the worker has finished then stops.  the stop message should cascade from the parent to the worker processes.
Flags: approval?
Attachment #704786 - Flags: review?(justdave)
Attached patch patch v2 (obsolete) — Splinter Review
this revision allows the parent jobqueue.pl to catch an INT signal and ask the worker jobqueue.pl to stop.

i also changed how the 'once' command works, in order to make it match the 'onepass'.  with this patch, 'once' and 'onepass' creates a pidfile and runs in the background by default.  anyone calling 'once' will have to include '-f' to avoid deamonisation.
Attachment #704786 - Attachment is obsolete: true
Attachment #710780 - Flags: review?(dkl)
Comment on attachment 710780 [details] [diff] [review]
patch v2

Review of attachment 710780 [details] [diff] [review]:
-----------------------------------------------------------------

Only issue I found is that when using /etc/init.d/bugzilla-queue to start/top the daemon on RHEL/CentOS systems, it automatically stores the pidfile as /var/run/bugzilla-queue.pid instead of <cgi_path>/data/bugzilla-queue.pid. Which is fine as long as the service runs as 'root'. If you do not run as 'root' and instead run as 'dkl' or other, the script dies when trying to create /var/run/bugzilla-queue.worker.pid as permission denied. Even though /var/run/bugzilla-queue.pid is owned by dkl.root as configured by the SystemV init scripts. Running jobqueue.pl -d -f start locally from the cgi_path dir works fine as my data dir is owned by dkl.dkl. Maybe not an issue we need to worry about but one none the less since there will not be just a single jobqueue process being started by /etc/init.d/bugzilla-queue when your patch is applied.

r=dkl
Attachment #710780 - Flags: review?(dkl) → review+
(In reply to David Lawrence [:dkl] from comment #10)
> Only issue I found is that when using /etc/init.d/bugzilla-queue to
> start/top the daemon on RHEL/CentOS systems, it automatically stores the
> pidfile as /var/run/bugzilla-queue.pid instead of
> <cgi_path>/data/bugzilla-queue.pid. Which is fine as long as the service
> runs as 'root'. If you do not run as 'root' and instead run as 'dkl' or
> other, the script dies when trying to create
> /var/run/bugzilla-queue.worker.pid as permission denied. Even though
> /var/run/bugzilla-queue.pid is owned by dkl.root as configured by the
> SystemV init scripts.

this shouldn't be a problem because init.d scripts should always be started by root, not by a normal user:

  print "$initscript installed.",
        " To start the daemon, do \"$dest_file start\" as root.\n";

it's odd that our init.d script creates and chowns the pidfile, instead of leaving that managed by the daemon::generic.
Flags: approval?
(In reply to Byron Jones ‹:glob› from comment #11)
> (In reply to David Lawrence [:dkl] from comment #10)
> > Only issue I found is that when using /etc/init.d/bugzilla-queue to
> > start/top the daemon on RHEL/CentOS systems, it automatically stores the
> > pidfile as /var/run/bugzilla-queue.pid instead of
> > <cgi_path>/data/bugzilla-queue.pid. Which is fine as long as the service
> > runs as 'root'. If you do not run as 'root' and instead run as 'dkl' or
> > other, the script dies when trying to create
> > /var/run/bugzilla-queue.worker.pid as permission denied. Even though
> > /var/run/bugzilla-queue.pid is owned by dkl.root as configured by the
> > SystemV init scripts.
> 
> this shouldn't be a problem because init.d scripts should always be started
> by root, not by a normal user:
> 
>   print "$initscript installed.",
>         " To start the daemon, do \"$dest_file start\" as root.\n";

Yes you need to start/stop the service as root. But as the initiscript is written, the PIDFILE is created as root before the daemon is even started. Then the PIDFILE is passed to the daemon as a parameter for it to use. The daemon itself is ran as a custom user specified by the /etc/sysconfig/bugzilla-queue config file. In my case I had the daemon user set to 'dkl' which cannot create an additional worker PIDFILE in the same dir as the parent jobqueue PIDFILE. Setting the USER=root in /etc/sysconfig/bugzilla-queue of course fixes all this but if we always have to be 'root' the configuration option should be removed or documented.

dkl
in that case perhaps it would be better to always create the worker's pid file in the data directory.
Comment on attachment 710780 [details] [diff] [review]
patch v2

Changing to r- so that the script can be updated to store the worker pid in the data/ dir instead of /var/run when executed from an initscript.

dkl
Attachment #710780 - Flags: review+ → review-
Attached patch patch v3Splinter Review
always store the worker's pidfile in bugzilla's data/ directory.
Attachment #710780 - Attachment is obsolete: true
Attachment #711355 - Flags: review?(dkl)
Comment on attachment 711355 [details] [diff] [review]
patch v3

r=dkl
Attachment #711355 - Flags: review?(dkl) → review+
We should add a comment to /etc/sysconfig/bugzilla-queue (generated by Bugzilla::JobQueue::Runner::gd_can_install) and also to the relnotes stating that the value for USER in the bugzilla-queue config file will need to have write permissions to the data/ directory of the Bugzilla installation.

dkl
Keywords: relnote
Status: NEW → ASSIGNED
Flags: approval?
Flags: approval4.4+
Flags: approval+
Target Milestone: --- → Bugzilla 4.4
Component: Bugzilla-General → Email Notifications
Committing to: bzr+ssh://bjones%40mozilla.com@bzr.mozilla.org/bugzilla/trunk/
modified jobqueue.pl
modified Bugzilla/JobQueue.pm
modified Bugzilla/JobQueue/Runner.pm
modified t/011pod.t
Committed revision 8573.

Committing to: bzr+ssh://bjones%40mozilla.com@bzr.mozilla.org/bugzilla/4.4/
modified jobqueue.pl
modified Bugzilla/JobQueue.pm
modified Bugzilla/JobQueue/Runner.pm
Committed revision 8514.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
(In reply to David Lawrence [:dkl] from comment #17)
> We should add a comment to /etc/sysconfig/bugzilla-queue (generated by
> Bugzilla::JobQueue::Runner::gd_can_install) and also to the relnotes stating
> that the value for USER in the bugzilla-queue config file will need to have
> write permissions to the data/ directory of the Bugzilla installation.

No need, it already does. So there is no need to relnote anything as there is nothing new here.
Keywords: relnote
OS: Mac OS X → All
Hardware: x86 → All
Which version of Daemon::Generic did you have when you encountered this problem? If your problem is the same as https://rt.cpan.org/Public/Bug/Display.html?id=36413, then we should require Daemon::Generic 0.71.
Err... I pasted the wrong link. I meant this one:

https://rt.cpan.org/Public/Bug/Display.html?id=58287
(In reply to Frédéric Buclin from comment #21)
> https://rt.cpan.org/Public/Bug/Display.html?id=58287

sorry, but i don't think that issue (hanging on logout/STDIN not detached correctly) is related to jobqueue's memory consumption.
Blocks: 1006288
You need to log in before you can comment on or make changes to this bug.