This morning, we experienced a problem with Zeus, which led to processor and monitor failure. Ashish restarted everything but things did not come back to life. Investigation of logs showed the following items: - processor07 was wedged and not doing anything. The last log lines were 2012-08-24 04:47:49,078 INFO - MainThread - registering with 'processors' table 2012-08-24 04:47:49,079 DEBUG - MainThread - looking for a dead processor for host sp-processor07.phx1.mozilla.com 2012-08-24 04:47:49,081 INFO - MainThread - will step in for processor 3168 2012-08-24 04:47:49,082 DEBUG - MainThread - taking over a dead processor - Other processors showed no jobs to do, and were sleeping for six seconds each - The monitor was running, but the only things showing up in its logs were the cleanup job, and the priority job thread. The MainThread was not showing up. Priority jobs were being processed normally. No other jobs were being processed. - The contents of the processors table was as follows: http://ashish.pastebin.mozilla.org/1774060 - The contents of the jobs table was as follows: http://ashish.pastebin.mozilla.org/1774061 I had Ashish stop all processors and monitors, and run the following SQL: delete from processors; delete from jobs; and then restart the processors and monitor. Everything came back to life normally. Lars, can you explain what might have happened, and decide whether there is something we might change in the code to avoid similar problems in future?
Excerpt from monitor log during the failure: http://laura.pastebin.mozilla.org/1774089
the monitor is deprecated - that made this problem go away.