Closed Bug 59994 Opened 25 years ago Closed 25 years ago

handle CGI timeouts/aborts by aborting database calls

Categories

(Bugzilla :: Bugzilla-General, defect, P3)

Sun
Solaris

Tracking

()

RESOLVED FIXED
Bugzilla 2.12

People

(Reporter: dmosedale, Assigned: endico)

Details

Some (many? all?) webservers abort CGI scripts if they output no text for long periods of time. In the case of Apache, this seems to be 300 seconds (5 minutes). For bugzilla.mozilla.org, this is probably a good value, because we don't want the shadow db tied up for much longer than that (syncs cannot happen while searches are going on). It turns out that the current mechanism for aborting the scripts doesn't cause any in-progress SELECTs to be aborted, which means that truly bad searches can and do continue for very long periods of time (hours even -- thus causing the shadow_db to remain out of sync). Bugzilla needs to have abort-handlers that kill any appropriate running SQL calls (not necessarily all SQL calls, however, as we can't afford to leave the db in an inconsistent state).
After much groveling through documentation (as well as the source code to Apache), I think I've figured out the strategy for this. It appears as though when CGIs are timed out by Apache, they get sent a SIGTERM and then three seconds later a SIGKILL, so this is an opportunity to cleanup. Ideally, we'd use the DBI statement cancel function. Unfortunately, DBI::mysql doesn't implement that, but presumably it could be added. Alternately, we could find the $dbh->{'thread_id'} and then use system() to fork off a mysqladmin to kill the thread in question or even perhaps exec() mysqladmin in our own process slot. Note, however, that "man DBI" tells us that: The first thing to say is that signal handling in Perl is currently not safe. There is always a small risk of Perl crashing and/or core dumping when, or after, handling a signal. (The risk was reduced with 5.004_04 but is still present.) There is also experimental code included with 5.005 (Thread::Signal) which runs which runs signal handlers on a separate thread. However, 5.005 doesn't appear to me to be built multithreaded by default, so this seems overly risky. In any case, we'll need to encourage folks to upgrade to 5.004_04. And crashing in a signal handler will usually be better than running forever and leaving the shadow database out of sync. Probably the short-term fix is to just use a cronjob script that runs 'mysqladmin processlist' every so often and kill off any long-running threads. In fact, we're likely to want to do this anyway, given the signal-handler lossage. It also wouldn't be difficult to patch MySQL to cause long_queries to be killed when they hit the time limit. With both of these quick fixes, though, I'm a bit afraid of one really evil SELECT (which might take > 5 mins on its own) significantly slowing down other SELECTs (which might only take 3 mins on their own) and then having the entire batch of threads get killed). Probably the answer to this (easy to implement in the cronjob) is to never kill more than 1 thread per x minute interval, for some value of x.
Status: NEW → ASSIGNED
I've got a test version of a thread-killer script that kills at most one thread every five minutes. It's running on bugzilla.mozilla.org, but right now it just sends me mail rather than killing anything, so that I can verify that it's really doing the right thing.
The script also sent me about 30 mails today... could that be fixed?
Dan: could you also add few extra lines to this script just to check the health of the database (bug 54818 which I'll mark depending on this). It could work as follows: if it can't connect to database -> $sleep = 5; $pid = `cat /opt/mysql/var/lounge.pid`; #no chop needed, pidfile don't have NL system("kill -TERM $pid"); sleep($sleep); system("sh /etc/init.d/mysql.server start"); system("echo database restarted | /bin/mailx -s mysql root");
Depends on: 54818
rko: I think the 30 mails were the result of a bug. I'll work on integrating the MySQLd killing code as well.
I've checked in the initial version of this code which has been running on lounge for a while. I haven't integrated Risto's mysqld checking suggestion yet, that's next.
Mass reassign of my Bugzilla bugs to endico, as I'm switching groups to work on mozilla LDAP integration full time.
Assignee: dmose → endico
Status: ASSIGNED → NEW
marking this bug fixed. and removing dependency on 54818 which is actually a request for an extra feature to this script.
Status: NEW → RESOLVED
Closed: 25 years ago
No longer depends on: 54818
Resolution: --- → FIXED
In search of accurate queries.... (sorry for the spam)
Target Milestone: --- → Bugzilla 2.12
Moving closed bugs to Bugzilla product
Component: Bugzilla → Bugzilla-General
Product: Webtools → Bugzilla
Version: other → unspecified
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.