21:52:27 < nagios-phx1>  sp-admin01.phx1:Thrift - socorro-thrift1.zlb.phx1.mozilla.com is CRITICAL: CRITICAL: Thrift connection to socorro-thrift1.zlb.phx1.mozilla.com is not viable hbase may need to be looked into, per mana. However, there were no connection alerts from processors preceding this.
22:43:22 < nagios-phx1>  tp-socorro01-master01.phx1:PostgreSQL Last Reports Update is WARNING: CHECKGANGLIA WARNING: last_record_reports is 3947.00 ^^ is quite likely because of this Called tmary a few times but no answer :(
Unable to wake up deinspanjer or xstevens either ;(
rhelmer thinks this needs action ASAP rather than later. Have asked ashish to page cshields.
(In reply to Shyam Mani [:fox2mike] from comment #3) > rhelmer thinks this needs action ASAP rather than later. Have asked ashish > to page cshields. Daniel is online, so didn't page Corey.
Severity: major → critical
The cluster shut down because the namenode service suddenly stopped. I have looked a bit to try to figure out why and haven't seen a cause yet. I restarted the NN, then restarted HBase RegionServers until they were all steady-on and reloading regions. The cluster seems to be coming back up fine. Need tmary to do some investigation as to the cause of the failure.
Assignee: server-ops → tmeyarivan
Severity: critical → major
Last Reports Update on tp-socorro01-master01.phx1 reduced for a short while before trending up again after deinspanjer's fix. It's now at ~11300, looping in Socorro devs before this goes critical.
Left a voicemail on Laura's phone. Bright side - last_record_reports is at 10603.00
Still not right. Looking at e.g. crashmover logs, I can see that Thrift connections sometimes succeed but mostly only on a retry. There's a lot of these errors, below. I would say more connections are failing than succeeding, but some are succeeding. 2011-08-19 05:02:15,661 DEBUG - Thread-4 - Thread-4 - retry_wrapper: handled exception, timed out 2011-08-19 05:02:15,661 ERROR - Thread-4 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'> 2011-08-19 05:02:15,662 ERROR - Thread-4 - the connection is not viable. retries fail: 2011-08-19 05:02:15,662 ERROR - Thread-4 - trace back follows: 2011-08-19 05:02:15,663 ERROR - Thread-4 - Traceback (most recent call last): 2011-08-19 05:02:15,664 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/crashstorage.py", line 301, in save_raw self.hbaseConnection.put_json_dump(uuid, jsonData, dump, number_of_retries=2) 2011-08-19 05:02:15,664 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/hbaseClient.py", line 144, in f result = fn(self, *args, **kwargs) 2011-08-19 05:02:15,665 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/hbaseClient.py", line 685, in put_json_dump self.client.mutateRow('crash_reports', row_id, mutationList) # unit test marker 233 2011-08-19 05:02:15,665 ERROR - Thread-4 - File "/data/socorro/thirdparty/hbase/hbase.py", line 1251, in mutateRow self.recv_mutateRow() 2011-08-19 05:02:15,666 ERROR - Thread-4 - File "/data/socorro/thirdparty/hbase/hbase.py", line 1264, in recv_mutateRow (fname, mtype, rseqid) = self._iprot.readMessageBegin() 2011-08-19 05:02:15,666 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin sz = self.readI32() 2011-08-19 05:02:15,667 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/protocol/TBinaryProtocol.py", line 203, in readI32 buff = self.trans.readAll(4) 2011-08-19 05:02:15,667 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TTransport.py", line 58, in readAll chunk = self.read(sz-have) 2011-08-19 05:02:15,668 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TTransport.py", line 155, in read self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER))) 2011-08-19 05:02:15,668 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TSocket.py", line 92, in read buff = self.handle.recv(sz) 2011-08-19 05:02:15,669 ERROR - Thread-4 - FatalException: the connection is not viable. retries fail:
tmary any updates?
Some regions in HBase are still down. HBase devs / Cloudera are trying to come up with a temporary fix to get the Regions online Till such time that a fix works, it is difficult to estimate ETA etc. --
Apart from affected data (regions corresponding to data from 18/19-Aug-2011), service is up/available. Re affected regions, Cloudera support and Michael Stack are working towards fixing them - unfortunately, no ETA --
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
I have a hacking session planned with Stack and Cloudera today at 3 Pacific.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.