Closed Bug 542624 Opened 14 years ago Closed 14 years ago

Need Socorro integration with Hadoop crash report storage to be able to retrieve crash reports

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dre, Assigned: dre)

References

Details

The flip side to bug 538206 storing crash reports in Hadoop is that Socorro needs to be able to pull them back out.
Once this is fully operational, we'll be able to drastically cut back storage of crash reports on the NFS server.

The Python code attached to bug 538206 contains two retrieval APIs, one that takes an OOID and another that takes a date range.

Please let me know what other requirements you might need in order to mimic the current way you interact with the NFS server to pull crash reports out.
Currently the processed crash files are read by Apache and served up to:
* web browers
* our PHP app via curl requests

Our app does not read these files via a filesystem.

The curl request is:
http://crash-stats.mozilla.com/dumps/<UUID>.jsonz

IT has control over this url and how it's served up.
processed crash files are a separate piece of this.

In order to get the processed crash files, the processor must retrieve the .json and .dump files.  That is the piece that needs to be extended to be able to retrieve them via a call to Hadoop instead of off of the NFS mount.

That said, it is a very important point that the output of the processor needs a place to live as well.  That would mean that the processor would probably want to retrieve a crash report, process it, then update the crash report in Hadoop with the .jsonz data.
Yes, sorry I wasn't clear. Comment #1 is a dependency for moving completely off a traditional filesystem.
Blocks: 543759
-> pythonic middleware
Version: 1.x → 1.7
Delivery of Socorro 1.7 is accommodating retrieval of all three critical pieces of data:
meta_data:json (the original submitted json)
raw_data:dump (the minidump binary)
processed_data:json (the "jsonz" file)

We need to make sure that the loose ends are tied up however.. maybe some blocking or depends bugs on this one?

The PHP app has no need to retrieve the original meta_data:json or the raw_data:dump, correct?  Currently, code is written in the monitor and processors to retrieve that data.

We need to make sure that calls to http://crash-stats.mozilla.com/dumps/<UUID>.jsonz are updated with the 1.7 push to retrieve the processed_data:json string from HBase.  This is currently possible by using the Python layer to invoke the method get_processed_json_as_string(ooid).
Assignee: nobody → deinspanjer
Target Milestone: Future → 1.7
(In reply to comment #5)

> The PHP app has no need to retrieve the original meta_data:json or the
> raw_data:dump, correct?  Currently, code is written in the monitor and
> processors to retrieve that data.

If a user is authorized, they can access the original metadata and raw_data files via Apache. We should continue to support this.
Okay, then we need to ensure the pythonic middleware supports calls to get_json_meta_as_string(ooid) and get_dump(ooid)

Please note that the more I type these method names the more I think we should have better names for them in hbaseClient.py. :)

Lars, how hairy would a cleanup refactoring be? Could we determine official names for these important methods and check in the code by tomorrow's code freeze?
cleanup refactoring would not be difficult and I highly encourage it.

BTW, earlier this afternoon, I checked in routines for the pythonic middleware that fetch things from hbase:

.../201005/crash/meta/by/uuid/4c0a21db-aeb8-4f5b-8fea-36a402100512
.../201005/crash/raw_crash/by/uuid/4c0a21db-aeb8-4f5b-8fea-36a402100512
.../201005/crash/processed/by/uuid/4c0a21db-aeb8-4f5b-8fea-36a402100512

I just haven't documented it yet.
Blocks: 565692
No longer blocks: 565692
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.