Open Bug 1336410 Opened 7 years ago Updated 5 years ago

Fix warning/error "Could not lock PID file" in jobqueue code

Categories

(bugzilla.mozilla.org :: Email Notifications, defect)

Production
defect
Not set
normal

Tracking

()

People

(Reporter: dylan, Unassigned)

References

Details

Feb  1 21:29:41 jobqueue2.bugs.scl3.mozilla.com jobqueue.pl[24518]: Sucessfully daemonized
Feb  1 21:29:41 jobqueue2.bugs.scl3.mozilla.com jobqueue.pl[24518]: Could not lock PID file /var/run/bugzilla-queue8.worker.pid: Resource temporarily unavailable at local/lib/perl5/Daemon/Generic.pm line 181.


Which happens because we re-execute jobqueue.pl... which calls Bugzilla::JobQueue::Runner->new()

Bugzilla::JobQueue::Runner ineherits from Daemon::Generic

https://metacpan.org/source/MUIR/Daemon-Generic-0.84/lib/Daemon/Generic.pm#L40

So error is coming from the lock() call on https://metacpan.org/source/MUIR/Daemon-Generic-0.84/lib/Daemon/Generic.pm#L181

Now this is interesting. Why would lock fail? Because the file is already locked -- and it is already locked because we re-execute jobqueue.pl from https://github.com/mozilla-bteam/bmo/blob/master/Bugzilla/JobQueue.pm#L114

This was done to reduce memory fragmentation/leaking in bug 832893. 

I am surprised that this works, reading the code for Daemon::Generic.

Each time jobqueue.pl is run, Daemon::Generic->new is called.
Provided that gd_pidfile is provided, an attempt at locking will happen.
However it must be that initially the pidfile isn't set, and the first two attempts at locking seem to succeed. Only the third, on line 181 fails. It's also not clear if this is happening in the parent or the child.

It's also not clear that error handling is correct in this instance. When the connection in the subprocess worker dies, is that error communicated to the queue runner? Is this related to entries in ts_error causing jobs to not get processed?

Finally, this method of getting the memory cleared results in *more* memory being used. If we used fork/exec(), each child process would share a lot of memory with its parent -- but in this case we have N job queue runners each with 1 subprocess worker, so we have a higher constant overheard (at the cost of a lower usage over time). I think with the memory leak fixes I've been working on we can switch back to the traditional model.
Summary: Consider reverting bug 832893 → Fix warning/error "Could not lock PID file" in jobqueue code
Assignee: dylan → nobody
You need to log in before you can comment on or make changes to this bug.