Closed Bug 1334851 Opened 9 years ago Closed 9 years ago

HBaseMainSummaryView ETL is failing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: rvitillo)

Details

User Story

The ETL job started timing out due to the following error:

Sat Jan 28 17:57:22 UTC 2017, RpcRetryingCaller{globalStartTime=1485625751841, pause=100, retries=35}, org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException): org.apache.hadoop.hbase.NotServingRegionException: Region main_summary,19ba6ed:,1485508115761.7850bfa3bc526797213ddeb676dfe128. is not online on ip-172-31-9-172.us-west-2.compute.internal,16020,1485508154760
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2929)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1057)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.bulkLoadHFile(RSRpcServices.java:1936)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33650)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2178)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
        at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
        at java.lang.Thread.run(Thread.java:745)



Digging into the logs of HBase I found that one region sever had been recently restarted due to a failure to split a region:

2017-01-27 09:08:46,567 INFO  [regionserver/ip-172-31-9-172.us-west-2.compute.internal/172.31.9.172:16020-splits-1484979125601] regionserver.SplitRequest: Running rollback/cleanup of failed split of main_summary,1999997c,1482348185998.53db17af9b14d0c46e6b05f7950ac735.; Failed ip-172-31-9-172.us-west-2.compute.internal,16020,1482347841787-daughterOpener=20677a82cd5bc4e10968b303d98bf460
java.io.IOException: Failed ip-172-31-9-172.us-west-2.compute.internal,16020,1482347841787-daughterOpener=20677a82cd5bc4e10968b303d98bf460
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.openDaughters(SplitTransactionImpl.java:499)
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.stepsAfterPONR(SplitTransactionImpl.java:597)
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.execute(SplitTransactionImpl.java:580)
	at org.apache.hadoop.hbase.regionserver.SplitRequest.doSplitting(SplitRequest.java:82)
	at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:154)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.io.IOException: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:9eb7991ee727498589fdc0154a3c5c81.53db17af9b14d0c46e6b05f7950ac735_4349324881
	at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:943)
	at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:818)
	at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:793)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6519)
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl.openDaughterRegion(SplitTransactionImpl.java:731)
	at org.apache.hadoop.hbase.regionserver.SplitTransactionImpl$DaughterOpener.run(SplitTransactionImpl.java:711)
	... 1 more
Caused by: java.io.IOException: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:9eb7991ee727498589fdc0154a3c5c81.53db17af9b14d0c46e6b05f7950ac735_4349324881
	at org.apache.hadoop.hbase.regionserver.HStore.openStoreFiles(HStore.java:545)
	at org.apache.hadoop.hbase.regionserver.HStore.loadStoreFiles(HStore.java:500)
	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:262)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5032)
	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:917)
	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:914)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
Caused by: java.lang.RuntimeException: Cached block contents differ, which should not have happened.cacheKey:9eb7991ee727498589fdc0154a3c5c81.53db17af9b14d0c46e6b05f7950ac735_4349324881
	at org.apache.hadoop.hbase.io.hfile.LruBlockCache.cacheBlock(LruBlockCache.java:368)
	at org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.cacheBlock(CombinedBlockCache.java:66)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:460)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:194)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekBefore(HFileReaderV2.java:666)
	at org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekBefore(HalfStoreFileReader.java:307)
	at org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekBefore(HalfStoreFileReader.java:184)
	at org.apache.hadoop.hbase.io.HalfStoreFileReader$1.seekBefore(HalfStoreFileReader.java:178)
	at org.apache.hadoop.hbase.io.HalfStoreFileReader.getLastKey(HalfStoreFileReader.java:341)
	at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:488)
	at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:504)
	at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:494)
	at org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:653)
	at org.apache.hadoop.hbase.regionserver.HStore.access$000(HStore.java:118)
	at org.apache.hadoop.hbase.regionserver.HStore$1.call(HStore.java:520)
	at org.apache.hadoop.hbase.regionserver.HStore$1.call(HStore.java:517)
	... 6 more
2017-01-27 09:08:46,578 ERROR [regionserver/ip-172-31-9-172.us-west-2.compute.internal/172.31.9.172:16020-splits-1484979125601] regionserver.HRegionServer: ABORTING region server ip-172-31-9-172.us-west-2.compute.internal,16020,1482347841787: Abort; we got an error after point-of-no-return



The "hbase hbck" tool confirmed that the main_summary table was in a inconsistent state:

ERROR: Region { meta => main_summary,1999997c,1485508115761.20677a82cd5bc4e10968b303d98bf460., hdfs => null, deployed => , replicaId => 0 } found in META, but not in HDFS or deployed on any region server.
ERROR: Region { meta => null, hdfs => null, deployed => ip-172-31-1-17.us-west-2.compute.internal,16020,1485354957477;main_summary,1999997c,1482348185998.53db17af9b14d0c46e6b05f7950ac735., replicaId => 0 }, key=53db17af9b14d0c46e6b05f7950ac735, not on HDFS or in hbase:meta but deployed on ip-172-31-1-17.us-west-2.compute.internal,16020,1485354957477
ERROR: Region { meta => main_summary,19ba6ed:,1485508115761.7850bfa3bc526797213ddeb676dfe128., hdfs => null, deployed => , replicaId => 0 } found in META, but not in HDFS or deployed on any region server.
ERROR: No regioninfo in Meta or HDFS. { meta => null, hdfs => null, deployed => ip-172-31-1-17.us-west-2.compute.internal,16020,1485354957477;main_summary,1999997c,1482348185998.53db17af9b14d0c46e6b05f7950ac735., replicaId => 0 }
ERROR: There is a hole in the region chain between 1999997c and 19db22b3.  You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table main_summary



I proceeded to repair the table with "sudo hbase hbck -repair" which completed successfully. 

Finally I restarted the hbase master with:
sudo stop hbase-master
sudo start hbase-master
No description provided.
User Story: (updated)
User Story: (updated)
Status: NEW → RESOLVED
Points: --- → 1
Closed: 9 years ago
Resolution: --- → FIXED
User Story: (updated)
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.