Last Comment Bug 832893 - change jobqueue.pl to spawn worker processes to deliver bugmail to avoid memory leaks
: change jobqueue.pl to spawn worker processes to deliver bugmail to avoid memo...
Status: RESOLVED FIXED
:
Product: Bugzilla
Classification: Server Software
Component: Email Notifications (show other bugs)
: 4.0.9
: All All
: -- critical (vote)
: Bugzilla 4.4
Assigned To: Byron Jones ‹:glob› [PTO until 2016-10-10]
: default-qa
Mentors:
Depends on:
Blocks: 1006288
  Show dependency treegraph
 
Reported: 2013-01-21 01:20 PST by Byron Jones ‹:glob› [PTO until 2016-10-10]
Modified: 2014-05-06 01:59 PDT (History)
4 users (show)
LpSolit: approval+
LpSolit: approval4.4+
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
patch v1 (3.44 KB, patch)
2013-01-22 00:52 PST, Byron Jones ‹:glob› [PTO until 2016-10-10]
dkl: review+
Details | Diff | Splinter Review
patch v2 (4.98 KB, patch)
2013-02-06 09:18 PST, Byron Jones ‹:glob› [PTO until 2016-10-10]
dkl: review-
Details | Diff | Splinter Review
patch v3 (5.04 KB, patch)
2013-02-07 08:12 PST, Byron Jones ‹:glob› [PTO until 2016-10-10]
dkl: review+
Details | Diff | Splinter Review

Description Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-01-21 01:20:03 PST
recently we've seen several occurrences of jobqueue.pl consuming large amounts of memory, pushing the webheads into swap.

for example, most recently had jobqueues consuming 1.4g each.


given the push frequency, we're not sure if this is a new issue, or it's a recent change.
Comment 1 Dave Miller [:justdave] (justdave@bugzilla.org) 2013-01-21 02:45:24 PST
As we start caching more and more objects I can imaging it having more stuff cached in memory the longer it runs.  I know we do a bunch of cleanup stuff on each request when handling web pages, but jobqueue isn't running inside a web request, so it's possible that cleanup is never getting done...  and this is almost surely an upstream issue.
Comment 2 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-01-21 03:45:16 PST
the cleanup happens via bug 822547.

i want to investigate this on bmo first, due to the number of extensions.  i will shift upstream if investigations point to it being an upstream issue.
Comment 3 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-01-22 00:50:04 PST
aside from higher baseline memory usage, i don't see bmo's jobqueue.pl consuming memory at a higher rate than trunk's (in fact it was less).

it's hard to tell if it's a memory leak, or just fragmentation due to all the small allocations perl does.  the cached stuff is cleared for each run, and i haven't been able to create a circular reference in the cache.

i think it may be easier to change how jobqueue.pl works..

while (1) {
  if (there are pending jobs in the queue) {
    system "jobqueue.pl -f process_pending_jobs"
  }
  sleep 5
}
Comment 4 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-01-22 00:52:59 PST
Created attachment 704786 [details] [diff] [review]
patch v1
Comment 5 Frédéric Buclin 2013-01-22 10:24:26 PST
Comment on attachment 704786 [details] [diff] [review]
patch v1

I'm really not the right person to review this patch. I cannot test jobqueue nor do I use mod_perl.


>=== modified file 'Bugzilla/JobQueue.pm'

> use parent qw(TheSchwartz);
>+use fields qw(_pidfile);

From what http://perldoc.perl.org/base.html says:

"Unless you are using the fields pragma, consider this module discouraged in favor of the lighter-weight parent."

To me, this means that if we use the fields pragma, you cannot use parent.


That's the single comment I can make. :)
Comment 6 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-01-22 23:54:25 PST
(In reply to Frédéric Buclin from comment #5)
> I cannot test jobqueue nor do I use mod_perl.

fwiw this isn't related to mod_perl.

i test jobqueue by setting use_mailer_queue to 1, triggering bugmail, and then running |./jobqueue.pl -d -f start|

to test this change, i wrote a simple script which calls MessageToMTA() hundreds of times.

> > use parent qw(TheSchwartz);
> >+use fields qw(_pidfile);
> 
> From what http://perldoc.perl.org/base.html says:
> 
> "Unless you are using the fields pragma, consider this module discouraged in
> favor of the lighter-weight parent."
> 
> To me, this means that if we use the fields pragma, you cannot use parent.

nice catch (although it appears to work fine with "parent").
i've changed "use parent" to "use base" locally, will hold off on uploading a new patch until the current one is reviewed.
Comment 7 David Lawrence [:dkl] 2013-01-30 14:34:54 PST
Comment on attachment 704786 [details] [diff] [review]
patch v1

Review of attachment 704786 [details] [diff] [review]:
-----------------------------------------------------------------

Nice. Looks fine and works for my testing. Except for what LpSolit mentioned, r=dkl
Comment 8 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-01-30 22:16:55 PST
during some more testing, i realized this patch has problems with stopping.

if a sub-process worker is running, and you ask the daemon to stop, it waits until the worker has finished then stops.  the stop message should cascade from the parent to the worker processes.
Comment 9 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-02-06 09:18:25 PST
Created attachment 710780 [details] [diff] [review]
patch v2

this revision allows the parent jobqueue.pl to catch an INT signal and ask the worker jobqueue.pl to stop.

i also changed how the 'once' command works, in order to make it match the 'onepass'.  with this patch, 'once' and 'onepass' creates a pidfile and runs in the background by default.  anyone calling 'once' will have to include '-f' to avoid deamonisation.
Comment 10 David Lawrence [:dkl] 2013-02-06 13:33:41 PST
Comment on attachment 710780 [details] [diff] [review]
patch v2

Review of attachment 710780 [details] [diff] [review]:
-----------------------------------------------------------------

Only issue I found is that when using /etc/init.d/bugzilla-queue to start/top the daemon on RHEL/CentOS systems, it automatically stores the pidfile as /var/run/bugzilla-queue.pid instead of <cgi_path>/data/bugzilla-queue.pid. Which is fine as long as the service runs as 'root'. If you do not run as 'root' and instead run as 'dkl' or other, the script dies when trying to create /var/run/bugzilla-queue.worker.pid as permission denied. Even though /var/run/bugzilla-queue.pid is owned by dkl.root as configured by the SystemV init scripts. Running jobqueue.pl -d -f start locally from the cgi_path dir works fine as my data dir is owned by dkl.dkl. Maybe not an issue we need to worry about but one none the less since there will not be just a single jobqueue process being started by /etc/init.d/bugzilla-queue when your patch is applied.

r=dkl
Comment 11 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-02-06 21:38:55 PST
(In reply to David Lawrence [:dkl] from comment #10)
> Only issue I found is that when using /etc/init.d/bugzilla-queue to
> start/top the daemon on RHEL/CentOS systems, it automatically stores the
> pidfile as /var/run/bugzilla-queue.pid instead of
> <cgi_path>/data/bugzilla-queue.pid. Which is fine as long as the service
> runs as 'root'. If you do not run as 'root' and instead run as 'dkl' or
> other, the script dies when trying to create
> /var/run/bugzilla-queue.worker.pid as permission denied. Even though
> /var/run/bugzilla-queue.pid is owned by dkl.root as configured by the
> SystemV init scripts.

this shouldn't be a problem because init.d scripts should always be started by root, not by a normal user:

  print "$initscript installed.",
        " To start the daemon, do \"$dest_file start\" as root.\n";

it's odd that our init.d script creates and chowns the pidfile, instead of leaving that managed by the daemon::generic.
Comment 12 David Lawrence [:dkl] 2013-02-07 06:56:15 PST
(In reply to Byron Jones ‹:glob› from comment #11)
> (In reply to David Lawrence [:dkl] from comment #10)
> > Only issue I found is that when using /etc/init.d/bugzilla-queue to
> > start/top the daemon on RHEL/CentOS systems, it automatically stores the
> > pidfile as /var/run/bugzilla-queue.pid instead of
> > <cgi_path>/data/bugzilla-queue.pid. Which is fine as long as the service
> > runs as 'root'. If you do not run as 'root' and instead run as 'dkl' or
> > other, the script dies when trying to create
> > /var/run/bugzilla-queue.worker.pid as permission denied. Even though
> > /var/run/bugzilla-queue.pid is owned by dkl.root as configured by the
> > SystemV init scripts.
> 
> this shouldn't be a problem because init.d scripts should always be started
> by root, not by a normal user:
> 
>   print "$initscript installed.",
>         " To start the daemon, do \"$dest_file start\" as root.\n";

Yes you need to start/stop the service as root. But as the initiscript is written, the PIDFILE is created as root before the daemon is even started. Then the PIDFILE is passed to the daemon as a parameter for it to use. The daemon itself is ran as a custom user specified by the /etc/sysconfig/bugzilla-queue config file. In my case I had the daemon user set to 'dkl' which cannot create an additional worker PIDFILE in the same dir as the parent jobqueue PIDFILE. Setting the USER=root in /etc/sysconfig/bugzilla-queue of course fixes all this but if we always have to be 'root' the configuration option should be removed or documented.

dkl
Comment 13 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-02-07 07:54:31 PST
in that case perhaps it would be better to always create the worker's pid file in the data directory.
Comment 14 David Lawrence [:dkl] 2013-02-07 08:00:33 PST
Comment on attachment 710780 [details] [diff] [review]
patch v2

Changing to r- so that the script can be updated to store the worker pid in the data/ dir instead of /var/run when executed from an initscript.

dkl
Comment 15 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-02-07 08:12:40 PST
Created attachment 711355 [details] [diff] [review]
patch v3

always store the worker's pidfile in bugzilla's data/ directory.
Comment 16 David Lawrence [:dkl] 2013-02-07 08:27:50 PST
Comment on attachment 711355 [details] [diff] [review]
patch v3

r=dkl
Comment 17 David Lawrence [:dkl] 2013-02-07 09:22:48 PST
We should add a comment to /etc/sysconfig/bugzilla-queue (generated by Bugzilla::JobQueue::Runner::gd_can_install) and also to the relnotes stating that the value for USER in the bugzilla-queue config file will need to have write permissions to the data/ directory of the Bugzilla installation.

dkl
Comment 18 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-02-14 21:58:28 PST
Committing to: bzr+ssh://bjones%40mozilla.com@bzr.mozilla.org/bugzilla/trunk/
modified jobqueue.pl
modified Bugzilla/JobQueue.pm
modified Bugzilla/JobQueue/Runner.pm
modified t/011pod.t
Committed revision 8573.

Committing to: bzr+ssh://bjones%40mozilla.com@bzr.mozilla.org/bugzilla/4.4/
modified jobqueue.pl
modified Bugzilla/JobQueue.pm
modified Bugzilla/JobQueue/Runner.pm
Committed revision 8514.
Comment 19 Frédéric Buclin 2013-02-17 17:21:04 PST
(In reply to David Lawrence [:dkl] from comment #17)
> We should add a comment to /etc/sysconfig/bugzilla-queue (generated by
> Bugzilla::JobQueue::Runner::gd_can_install) and also to the relnotes stating
> that the value for USER in the bugzilla-queue config file will need to have
> write permissions to the data/ directory of the Bugzilla installation.

No need, it already does. So there is no need to relnote anything as there is nothing new here.
Comment 20 Frédéric Buclin 2013-07-21 18:00:06 PDT
Which version of Daemon::Generic did you have when you encountered this problem? If your problem is the same as https://rt.cpan.org/Public/Bug/Display.html?id=36413, then we should require Daemon::Generic 0.71.
Comment 21 Frédéric Buclin 2013-07-22 01:15:43 PDT
Err... I pasted the wrong link. I meant this one:

https://rt.cpan.org/Public/Bug/Display.html?id=58287
Comment 22 Byron Jones ‹:glob› [PTO until 2016-10-10] 2013-07-22 01:22:15 PDT
(In reply to Frédéric Buclin from comment #21)
> https://rt.cpan.org/Public/Bug/Display.html?id=58287

sorry, but i don't think that issue (hanging on logout/STDIN not detached correctly) is related to jobqueue's memory consumption.

Note You need to log in before you can comment on or make changes to this bug.