Closed Bug 608734 Opened 11 years ago Closed 11 years ago

Change collector to default to using local storage

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: laura, Assigned: lars)

Details

As discussed at the architecture meeting:
- Collector should write to local storage
- The hbase(re)submitter.py should pick up jobs and:
  - try to write them to hbase
  - if that fails, failover to writing them to NFS if it's configured

I think this should all be configurable, so basically collectors have a config setting for what to use for primary storage, and what to use for fallback storage.  This way if we have a local storage problem, changing back to writing direct to HBase should be as easy as changing configs.

Submitters should also be configurable, since we will likely run a couple.  Configurable parts should be:
- filesystem to read crashes from (local or NFS)
- destination to write crashes to (Hbase or NFS)

If I have missed anything from our discussion, please add it to this bug.
Does this include switching to the mod-wsgi collector or is that a later release?  For the moment, I'm assuming that's a later release.
This is proving to be more problematic that it originally looked.  There are two problems:

1) it appears that crash volume can overwhelm the local storage system.  We've noticed that when HBase is down and the collectors are using fallback storage, nagios will complain that it cannot submit.  From the perspective of a client (nagios in this case) there is no functional difference between HBase storage and fallback storage.  If there is a failure during HBase downtime, then it is truly fallback (local file system) failing.  I've verified this experimentally.

2) the FS system storage scheme contains a subtle delay in when crashes are detected as new.  Crashes are stored in 5 minute buckets out at the leaf end of date storage directories. Because it's still being filled, the current 5 minute bucket is not considered when iterating over the list of "new" crashes.  This delay will result in derailing prompt processing of priority jobs.  The old original system did not suffer this problem because the monitor had access the the NFS share and could access the crash directly without having to go through the five minute directory.

Not sure how to resolve these problems... thinking about it.
I've got an implementation of this system using the classic JsonDumpStorage system (the same system used for fallback storage in the current collector).  JsonDumpStorage has a lot of unnecessary overhead that is not used by this collector.  While the system that I created works, I would suggest that if it shows signs that it isn't fast enough, we code a variant of the JsonDumpStorage without the unneeded features.  

The new system does not include a fallback mechanism. I have held off on implementation until I've resolved another requested feature: the configurable storage mechanisms.

This is a great feature request that could be used throughout Socorro to allow any storage mechanism to be used for any stage of processing.  I immediately see that this would be a way to facilitate working with other projects that might not want to adopt hbase.

However, there is a complication: to really get this right, there would need to be some changes to the way that configuration is done.  It would be most convenient to have configuration namespaces in the manner that 'argparse' from Python 2.7 has them.   I could dive in and implement them, but since we're in the midst of changes to how we're doing configuration, I figure we should coordinate the efforts.

To recap, I've got a solution implemented to resolve this bug, but it isn't as flexible as it could be.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.