Closed
Bug 680348
Opened 14 years ago
Closed 14 years ago
Socorro Thrift Prod connection not viable
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ashish, Assigned: tmary)
References
Details
21:52:27 < nagios-phx1> [121] sp-admin01.phx1:Thrift - socorro-thrift1.zlb.phx1.mozilla.com is CRITICAL: CRITICAL: Thrift connection to socorro-thrift1.zlb.phx1.mozilla.com is not viable
hbase may need to be looked into, per mana. However, there were no connection alerts from processors preceding this.
Reporter | ||
Comment 1•14 years ago
|
||
22:43:22 < nagios-phx1> [192] tp-socorro01-master01.phx1:PostgreSQL Last Reports Update is WARNING: CHECKGANGLIA WARNING: last_record_reports is 3947.00
^^ is quite likely because of this
Called tmary a few times but no answer :(
Reporter | ||
Comment 2•14 years ago
|
||
Unable to wake up deinspanjer or xstevens either ;(
Comment 3•14 years ago
|
||
rhelmer thinks this needs action ASAP rather than later. Have asked ashish to page cshields.
Comment 4•14 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #3)
> rhelmer thinks this needs action ASAP rather than later. Have asked ashish
> to page cshields.
Daniel is online, so didn't page Corey.
Severity: major → critical
Comment 5•14 years ago
|
||
The cluster shut down because the namenode service suddenly stopped. I have looked a bit to try to figure out why and haven't seen a cause yet.
I restarted the NN, then restarted HBase RegionServers until they were all steady-on and reloading regions.
The cluster seems to be coming back up fine. Need tmary to do some investigation as to the cause of the failure.
Assignee: server-ops → tmeyarivan
Severity: critical → major
Reporter | ||
Comment 6•14 years ago
|
||
Last Reports Update on tp-socorro01-master01.phx1 reduced for a short while before trending up again after deinspanjer's fix. It's now at ~11300, looping in Socorro devs before this goes critical.
Reporter | ||
Comment 7•14 years ago
|
||
Left a voicemail on Laura's phone. Bright side - last_record_reports is at 10603.00
Comment 8•14 years ago
|
||
Still not right. Looking at e.g. crashmover logs, I can see that Thrift connections sometimes succeed but mostly only on a retry. There's a lot of these errors, below. I would say more connections are failing than succeeding, but some are succeeding.
2011-08-19 05:02:15,661 DEBUG - Thread-4 - Thread-4 - retry_wrapper: handled exception, timed out
2011-08-19 05:02:15,661 ERROR - Thread-4 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'>
2011-08-19 05:02:15,662 ERROR - Thread-4 - the connection is not viable. retries fail:
2011-08-19 05:02:15,662 ERROR - Thread-4 - trace back follows:
2011-08-19 05:02:15,663 ERROR - Thread-4 - Traceback (most recent call last):
2011-08-19 05:02:15,664 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/crashstorage.py", line 301, in save_raw
self.hbaseConnection.put_json_dump(uuid, jsonData, dump, number_of_retries=2)
2011-08-19 05:02:15,664 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/hbaseClient.py", line 144, in f
result = fn(self, *args, **kwargs)
2011-08-19 05:02:15,665 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/hbaseClient.py", line 685, in put_json_dump
self.client.mutateRow('crash_reports', row_id, mutationList) # unit test marker 233
2011-08-19 05:02:15,665 ERROR - Thread-4 - File "/data/socorro/thirdparty/hbase/hbase.py", line 1251, in mutateRow
self.recv_mutateRow()
2011-08-19 05:02:15,666 ERROR - Thread-4 - File "/data/socorro/thirdparty/hbase/hbase.py", line 1264, in recv_mutateRow
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
2011-08-19 05:02:15,666 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
sz = self.readI32()
2011-08-19 05:02:15,667 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/protocol/TBinaryProtocol.py", line 203, in readI32
buff = self.trans.readAll(4)
2011-08-19 05:02:15,667 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz-have)
2011-08-19 05:02:15,668 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TTransport.py", line 155, in read
self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER)))
2011-08-19 05:02:15,668 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TSocket.py", line 92, in read
buff = self.handle.recv(sz)
2011-08-19 05:02:15,669 ERROR - Thread-4 - FatalException: the connection is not viable. retries fail:
Updated•14 years ago
|
Severity: major → critical
Reporter | ||
Updated•14 years ago
|
Group: infra
Comment 11•14 years ago
|
||
tmary any updates?
Assignee | ||
Comment 12•14 years ago
|
||
Some regions in HBase are still down. HBase devs / Cloudera are trying to come up with a temporary fix to get the Regions online
Till such time that a fix works, it is difficult to estimate ETA etc.
--
Assignee | ||
Comment 13•14 years ago
|
||
Apart from affected data (regions corresponding to data from 18/19-Aug-2011), service is up/available. Re affected regions, Cloudera support and Michael Stack are working towards fixing them - unfortunately, no ETA
--
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 14•14 years ago
|
||
I have a hacking session planned with Stack and Cloudera today at 3 Pacific.
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•