Closed Bug 599352 Opened 12 years ago Closed 12 years ago

Hbase region of ooids starting with 8 is broken


(Socorro :: General, task, P1)


(Not tracked)



(Reporter: laura, Assigned: dre)



(1 file)

No description provided.
From Daniel's email:
"we discovered that all of the latest failures are for records that have keys beginning with "8".  You probably remember that our rowkeys are formatted as <first hex char of guid><date><guid> so that pointed at a bad region.

I ran the version of check_meta.rb that we have installed, and it did in fact discover a hole in the meta for that range. Unfortunately, it throws an exception when attempting to --fix the problem (listed below).  I checked several of the region ids listed below, and most of them are those old regions that we archived back in June.  The data is still sitting in the /hbase/crash_reports table and taking up space and (it appears) getting in the way of check_meta.rb.

At the point we are at now, it would be acceptable to just delete these old regions *if* we had a good way to figure out which is which and do so."
Assignee: nobody → deinspanjer
Priority: -- → P1
Daniel - I looked at the meta_before_excise.txt file, what if we ran a MR job that deleted all keys with the the salt '<hex>0100610'?
Some of the keys might belong to the new regions but given we have the data archived on NFS, it'll at least buy us some room.....
The old keys are ones that start with 10, not ones with a salt char.
We don't have a backup of the data, we don't have an nfs with 10 TB of data anywhere.
Each reigon has a name that starts with tablename,startkey, but that is not how they are stored in hdfs, they are in hdfs as the "encoded name" which is an integer.  In that file, the integer is the one after the readable name.
What if we parsed the file @ /home/deinspenjer/meta_before_excise.txt looking for the start and end rows with '1002*', i.e. find keys in old format and then grabbed the corresponding integer.... would that help in cleanup?
Yes, that is the track I was thinking of.  regions starting with crash_reports,100* get the encoded region name, delete those files in hdfs.
alright, i am working on it, will update the ticket once its done, whats the impact if we accidently delete a wrong region?
(In reply to comment #6)
> alright, i am working on it, will update the ticket once its done, whats the
> impact if we accidently delete a wrong region?

Correct me if I'm wrong, but unrecoverable data loss?
Permanently lost data. Don't write something that does the delete, write something that can just output the hdfs paths we wish to delete.  Then, we can spot check and feed that to hadoop fs rm.
yup. will only be printing the paths. no hdfs operations..
sample values here, see comment attachment @ #10 for full set.

100227ac7538ac-c9cf-4fb0-babb-eeee22100227      1267328647632
1002279bdf3f1c-ee87-42b2-96d3-7845d2100227      1267328345569
100227c739deeb-daca-4ec5-a939-a67262100227      1267327858753
100227b74ccca5-d0a6-4252-9369-c240a2100227      1267331732018
100226b22f9efa-796c-49b1-8826-116662100226      1267253586726
patrick angels (from cloudera) suggested we try renaming these regions instead of outright delete..... thoughts?
The region was brought back online without having to do anything with these extra regions.  That said, we should take a look at what to do with them in another bug.
Daniel, can I close this?
Yes. it is ready to be closed.
Closed: 12 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.