Closed
Bug 557827
Opened 14 years ago
Closed 14 years ago
Socorro data in hbase is in wrong format
Categories
(Socorro :: General, task, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
1.7
People
(Reporter: lars, Assigned: aphadke)
References
Details
(Whiteboard: ETA - 7/16, confirm hbase stability)
A series of snafus resulted in all json data shoved into hbase prior to the production push of 1.6 being in Python repr format instead of json. This a perfect opportunity for a map job. The data just needs to be read back in to python (or jython) and then reserialized in json and put back.
Reporter | ||
Updated•14 years ago
|
Assignee: nobody → deinspanjer
Target Milestone: --- → 1.7
Updated•14 years ago
|
Priority: -- → P1
Comment 1•14 years ago
|
||
Just to make sure, did the correction of the data insert get fixed or is it still going in "bad"? If fixed, the date the fix hit production would be handy. Passing this over to Anurag to develop the MR job.
Assignee: deinspanjer → aphadke
Assignee | ||
Comment 2•14 years ago
|
||
To confirm: table name = crash_reports key = ooid (hbase ooid) metadata: <byte array of the serialized json string>
Reporter | ||
Comment 3•14 years ago
|
||
yes, that's the right table and column. The data in that 'meta_data:json' isn't really json for the data inserted prior to the shipping of Socorro 1.6. It is instead in Python repr form, which is very similar to json. They may be so close as to simply needing double quotes changed to single quotes. However, that needs to be verified.
Assignee | ||
Comment 4•14 years ago
|
||
Lars - Can you provide me with a date-range or list of ooid's for the wrong format data? I am planning to write a M/R job that can fix this format bug.
Reporter | ||
Comment 5•14 years ago
|
||
The error began as soon as we started pushing crashes to production. So the date range would be **beginning-of-time** through sometime in the evening on 2010-04-08. I can't see exactly what time the correction was pushed to production.
Assignee | ||
Comment 6•14 years ago
|
||
Daniel - What's the best/preferred way to get about 10000 rows from prod to dev that contain the wrongly formatted data? I can write a Map/Reduce job or python script. Given that its production, I prefer running it by you before touching it :-)
Comment 7•14 years ago
|
||
I wouldn't put Python in the mix as it could cloud the problem or the resolution. Let's check on #hbase for a clean effective method.
Assignee | ||
Comment 8•14 years ago
|
||
Daniel, We need to move some data from prod to dev (or stage). Here's the export command that I tried bunch of times and seems to be working. hadoop jar hbase.jar export 'crash_reports' hbase-output/ 1262304000 1262563200 hadoop jar hbase.jar export '<table_name>' <output_folder> startime endtime The above export command should give us 4 days of data from jan 1, 2010 to jan 4, 2010 I wanted to run the above command by you before we hit the prod-hbase
Assignee | ||
Comment 9•14 years ago
|
||
Lars, Here's the sample meta_data from April 02, 2010: {'submitted_timestamp': '2010-02-26T23:32:41.508290', 'StartupTime': '1267255939', 'Vendor': 'Mozilla', 'InstallTime': '1266593033', 'timestamp': 1267255961.509316, 'BuildID': '20091221164558', 'SecondsSinceLastCrash': '7', 'URL': '', 'ProductName': 'Firefox', 'Throttleable': '1', 'Version': '3.5.7', 'CrashTime': '1267255939', 'Email': 'info@missgermany.de'} IIRC, the M/R job needs to replace the ' with " for above to look like: {"submitted_timestamp": "2010-02-26T23:32:41.508290", "StartupTime": "1267255939", "Vendor": "Mozilla", "InstallTime": "1266593033", "timestamp": 1267255961.509316, "BuildID": "20091221164558", "SecondsSinceLastCrash": "7", "URL": "", "ProductName": "Firefox", "Throttleable": "1", "Version": "3.5.7", "CrashTime": "1267255939", "Email": "info@missgermany.de"} Can u confirm?
Assignee | ||
Comment 10•14 years ago
|
||
I missed escaping the quotes.... original snafu'ed format: {'submitted_timestamp': '2010-02-26T23:32:41.508290', 'StartupTime': '1267255939', 'Vendor': 'Mozilla', 'InstallTime': '1266593033', 'timestamp': 1267255961.509316, 'BuildID': '20091221164558', 'SecondsSinceLastCrash': '7', 'URL': '', 'ProductName': 'Firefox', 'Throttleable': '1', 'Version': '3.5.7','CrashTime': '1267255939', 'Email': 'info@missgermany.de'} New fixed format: "{\"submitted_timestamp\": \"2010-02-26T23:32:41.508290\", \"StartupTime\": \"1267255939\", \"Vendor\": \"Mozilla\", \"InstallTime\": \"1266593033\", \"timestamp\": 1267255961.509316, \"BuildID\": \"20091221164558\", \"SecondsSinceLastCrash\": \"7\", \"URL\": \"\", \"ProductName\": \"Firefox\", \"Throttleable\": \"1\", \"Version\": \"3.5.7\",\"CrashTime\": \"1267255939\", \"Email\": \"info@missgermany.de\"}" Is this correct?
Reporter | ||
Comment 11•14 years ago
|
||
the second literal from Comment #9 is correct and works great. In comment #10, escaping the quotes breaks it. The stringified json should not include an opening double quotation mark.
Assignee | ||
Comment 12•14 years ago
|
||
Daniel, Currently, the meta_data:json value in byte format in HBase looks like this: value=\x22\x7B\x5C\x22submitted_timestamp\x5C\x22 The equivalent string conversion: value="{\"submitted_timestamp\" I understand why we need to ignore the first double-quotes immediately after value= However, I am falling short of why each double-quote is being escaped? Is it safe to say that HBase-Thrift client by default escapes double-quotes on a PUT and a GET operation via HBase-Thrift client (for python) marshals it out without double-quotes? So effectively, for the fix, should I be replacing ' with \" or ' with "
Comment 13•14 years ago
|
||
This will be done in prod *after* the 1.7 push
Assignee | ||
Comment 14•14 years ago
|
||
Sample code has been tested @ staging HBase and verified by Lars.
Assignee | ||
Comment 15•14 years ago
|
||
Documentation to run the hadoop job: https://intranet.mozilla.org/Metrics/Crash_Report_Analysis_Project/Cluster_Notes/CleanupMapReduceJobsForSocorroProduction
Assignee | ||
Updated•14 years ago
|
Whiteboard: ETA - 7/16 need some spare cycles on my (aphadke's) end to run the job/s to completion
Assignee | ||
Updated•14 years ago
|
Whiteboard: ETA - 7/16 need some spare cycles on my (aphadke's) end to run the job/s to completion → ETA - 7/16, confirm hbase stability
Assignee | ||
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•