Closed Bug 557827 Opened 14 years ago Closed 14 years ago

Socorro data in hbase is in wrong format

Categories

(Socorro :: General, task, P1)

x86
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Assigned: aphadke)

References

Details

(Whiteboard: ETA - 7/16, confirm hbase stability)

A series of snafus resulted in all json data shoved into hbase prior to the production push of 1.6 being in Python repr format instead of json.

This a perfect opportunity for a map job.  The data just needs to be read back in to python (or jython) and then reserialized in json and put back.
Assignee: nobody → deinspanjer
Target Milestone: --- → 1.7
Depends on: 552983
Priority: -- → P1
Just to make sure, did the correction of the data insert get fixed or is it still going in "bad"? If fixed, the date the fix hit production would be handy.

Passing this over to Anurag to develop the MR job.
Assignee: deinspanjer → aphadke
To confirm:
table name = crash_reports
key = ooid (hbase ooid)
metadata: <byte array of the serialized json string>
yes, that's the right table and column.  The data in that 'meta_data:json' isn't really json for the data inserted prior to the shipping of Socorro 1.6.  It is instead in Python repr form, which is very similar to json.  They may be so close as to simply needing double quotes changed to single quotes.  However, that needs to be verified.
Lars - 
Can you provide me with a date-range or list of ooid's for the wrong format data? I am planning to write a M/R job that can fix this format bug.
The error began as soon as we started pushing crashes to production.  So the date range would be **beginning-of-time** through sometime in the evening on 2010-04-08.  I can't see exactly what time the correction was pushed to production.
Daniel - What's the best/preferred way to get about 10000 rows from prod to dev that contain the wrongly formatted data? I can write a Map/Reduce job or python script. Given that its production, I prefer running it by you before touching it :-)
I wouldn't put Python in the mix as it could cloud the problem or the resolution.
Let's check on #hbase for a clean effective method.
Daniel,
We need to move some data from prod to dev (or stage). Here's the export command that I tried bunch of times and seems to be working.
hadoop jar hbase.jar export 'crash_reports' hbase-output/ 1262304000 1262563200
hadoop jar hbase.jar export '<table_name>' <output_folder> startime endtime

The above export command should give us 4 days of data from jan 1, 2010 to jan 4, 2010
I wanted to run the above command by you before we hit the prod-hbase
Lars,
Here's the sample meta_data from April 02, 2010:
{'submitted_timestamp': '2010-02-26T23:32:41.508290', 'StartupTime': '1267255939', 'Vendor': 'Mozilla', 'InstallTime': '1266593033', 'timestamp': 1267255961.509316, 'BuildID': '20091221164558', 'SecondsSinceLastCrash': '7', 'URL': '', 'ProductName': 'Firefox', 'Throttleable': '1', 'Version': '3.5.7', 'CrashTime': '1267255939', 'Email': 'info@missgermany.de'}

IIRC, the M/R job needs to replace the ' with " for above to look like:
{"submitted_timestamp": "2010-02-26T23:32:41.508290", "StartupTime": "1267255939", "Vendor": "Mozilla", "InstallTime": "1266593033", "timestamp": 1267255961.509316, "BuildID": "20091221164558", "SecondsSinceLastCrash": "7", "URL": "", "ProductName": "Firefox", "Throttleable": "1", "Version": "3.5.7", "CrashTime": "1267255939", "Email": "info@missgermany.de"}

Can u confirm?
I missed escaping the quotes....
original snafu'ed format:
{'submitted_timestamp': '2010-02-26T23:32:41.508290', 'StartupTime': '1267255939', 'Vendor': 'Mozilla', 'InstallTime': '1266593033', 'timestamp': 1267255961.509316, 'BuildID': '20091221164558', 'SecondsSinceLastCrash': '7', 'URL': '', 'ProductName': 'Firefox', 'Throttleable': '1', 'Version': '3.5.7','CrashTime': '1267255939', 'Email': 'info@missgermany.de'}

New fixed format:
"{\"submitted_timestamp\": \"2010-02-26T23:32:41.508290\", \"StartupTime\": \"1267255939\", \"Vendor\": \"Mozilla\", \"InstallTime\": \"1266593033\", \"timestamp\": 1267255961.509316, \"BuildID\": \"20091221164558\", \"SecondsSinceLastCrash\": \"7\", \"URL\": \"\", \"ProductName\": \"Firefox\", \"Throttleable\": \"1\", \"Version\": \"3.5.7\",\"CrashTime\": \"1267255939\", \"Email\": \"info@missgermany.de\"}"

Is this correct?
the second literal from Comment #9 is correct and works great.  In comment #10, escaping the quotes breaks it.  The stringified json should not include an opening double quotation mark.
Daniel,
Currently, the meta_data:json value in byte format in HBase looks like this:
value=\x22\x7B\x5C\x22submitted_timestamp\x5C\x22

The equivalent string conversion:
value="{\"submitted_timestamp\"

I understand why we need to ignore the first double-quotes immediately after value=
However, I am falling short of why each double-quote is being escaped? Is it safe to say that HBase-Thrift client by default escapes double-quotes on a PUT and a GET operation via HBase-Thrift client (for python) marshals it out without double-quotes?

So effectively, for the fix, should I be replacing 
' with \" 
or
' with "
This will be done in prod *after* the 1.7 push
Sample code has been tested @ staging HBase and verified by Lars.
Whiteboard: ETA - 7/16 need some spare cycles on my (aphadke's) end to run the job/s to completion
Whiteboard: ETA - 7/16 need some spare cycles on my (aphadke's) end to run the job/s to completion → ETA - 7/16, confirm hbase stability
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.