Closed Bug 420809 Opened 13 years ago Closed 13 years ago

Refactor current mutual exclusion method

Categories

(Socorro :: General, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: morgamic, Assigned: lars)

References

Details

Attachments

(1 file)

The current monitor is flawed because it achieves mutual exclusion through handling an SQLException, which could happen for any number of reasons.  The current code assumes that if an SQLException happens it is because of an attempted insert on a duplicate reports.id, and this is the wrong way to have concurrency between multiple instances of the processor.

Use of Gamin was thwarted by NFS and its inability to be polled by a fs monitoring lib like Gamin. An alternative locking mechanism using the database should suffice (using existing db connections) without creating too much additional overhead.
Attached image locking workflow
Priority: -- → P1
Any exclusion done in the database would end being a variant of what is already done.  I would suggest that just fixing the current method would be quickest and most appropriate course of action.

The current system catches the exception SQLError and assumes that it exists because of a duplicated ID in the database.  The class SQLError (actually renamed DBAPIError in more recent versions of SQLAlchemy) broadly covers all server errors.  However, it is possible to determine the actual cause of the error through a component within DBAPIError called "orig".  This is an instance of the original exception raised by the underlying database API.  This exception could be checked to see if it id a "Duplicate Index" or some such error.  While it is true that will, perhaps, tie the code tightly to the underlying database, such will be the case anyway if SQLAlchemy is retired.
There is another problem with this method.  If report insertion is successful and then insertion fails for modules or frames we have a zombie report record and the processor always assumes this record has been previously processed.

This is a problem with not using a transactional way to process dumps, but I think changing our locking mechanism would help this.

In addition to that, the report_id auto increment sequence gets updated when failures happen, so for every successful report the id gets incremented hundreds to thousands of times.
I believe lars has fixed this with his processor refactoring.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Want to keep this open until it's deployed.  Still have to stage it in prod environment.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Update from Lars (from email) -- Lars could you add to this if something is missing:

* processor.py now responds to SIGTERM as well as ^C to terminate.  On responding to one of those signals, it shuts down in an orderly manner by unregistering itself from the database.  After a processor.py terminates, monitor.py will reallocate any pending jobs.
* both programs now have a complete command line interface for changing options
* the config.py has been split into two parts, one for each script: socorro/monitor/config.py and socorro/processor/config.py -- the original socorro/lib/config.py is deprecated and unused.
* the format for the config.py files has changed to a more friendly format for handling parameters as command line options. It should be fairly self explanatory.
* all writing directly to stdout and stderr has been removed in favor using Python Logging.  Both programs will log ERROR and CRITICAL to the stderr while at the same time logging all error/warning/info/debug levels to rotating log files.  This behavior is configurable in the config.py files or the command line (or environment variables).  

The failure of monitor.py to move failed json and dump files on 'stage' is unresolved.  In spite of hours of trying, I cannot get it to fail in the same way on 'khan'.  When I mess around with file permissions to force a failure, I get proper error reporting behavior.  This silent failure stumps me.  I have instrumented monitor.py with a --debug=True option that will let tell the program to verify the file relocation of the .json and .dump files.  This should, at least, catch the problem in the act.
Status: REOPENED → ASSIGNED
Blocks: 422584
Blocks: 422581
Pushed, 4/3 @ 8pm PST.  Nice job Lars!
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.