Socorro Thrift Prod connection not viable

RESOLVED FIXED

Status

--
critical
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: ashish, Assigned: tmary)

Tracking

Details

(Reporter)

Description

7 years ago
21:52:27 < nagios-phx1> [121] sp-admin01.phx1:Thrift - socorro-thrift1.zlb.phx1.mozilla.com is CRITICAL: CRITICAL: Thrift connection to socorro-thrift1.zlb.phx1.mozilla.com is not viable

hbase may need to be looked into, per mana. However, there were no connection alerts from processors preceding this.
(Reporter)

Comment 1

7 years ago
22:43:22 < nagios-phx1> [192] tp-socorro01-master01.phx1:PostgreSQL Last Reports Update is WARNING: CHECKGANGLIA WARNING: last_record_reports is 3947.00

^^ is quite likely because of this

Called tmary a few times but no answer :(
(Reporter)

Comment 2

7 years ago
Unable to wake up deinspanjer or xstevens either ;(
rhelmer thinks this needs action ASAP rather than later. Have asked ashish to page cshields.
(In reply to Shyam Mani [:fox2mike] from comment #3)
> rhelmer thinks this needs action ASAP rather than later. Have asked ashish
> to page cshields.

Daniel is online, so didn't page Corey.
Severity: major → critical
The cluster shut down because the namenode service suddenly stopped.  I have looked a bit to try to figure out why and haven't seen a cause yet.

I restarted the NN, then restarted HBase RegionServers until they were all steady-on and reloading regions.

The cluster seems to be coming back up fine.  Need tmary to do some investigation as to the cause of the failure.
Assignee: server-ops → tmeyarivan
Severity: critical → major
(Reporter)

Comment 6

7 years ago
Last Reports Update on tp-socorro01-master01.phx1 reduced for a short while before trending up again after deinspanjer's fix. It's now at ~11300, looping in Socorro devs before this goes critical.
(Reporter)

Comment 7

7 years ago
Left a voicemail on Laura's phone. Bright side - last_record_reports is at 10603.00

Comment 8

7 years ago
Still not right.  Looking at e.g. crashmover logs, I can see that Thrift connections sometimes succeed but mostly only on a retry.  There's a lot of these errors, below.  I would say more connections are failing than succeeding, but some are succeeding.  

2011-08-19 05:02:15,661 DEBUG - Thread-4 - Thread-4 - retry_wrapper: handled exception, timed out
2011-08-19 05:02:15,661 ERROR - Thread-4 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'>
2011-08-19 05:02:15,662 ERROR - Thread-4 - the connection is not viable.  retries fail: 
2011-08-19 05:02:15,662 ERROR - Thread-4 - trace back follows:
2011-08-19 05:02:15,663 ERROR - Thread-4 - Traceback (most recent call last):
2011-08-19 05:02:15,664 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/crashstorage.py", line 301, in save_raw
    self.hbaseConnection.put_json_dump(uuid, jsonData, dump, number_of_retries=2)
2011-08-19 05:02:15,664 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/hbaseClient.py", line 144, in f
    result = fn(self, *args, **kwargs)
2011-08-19 05:02:15,665 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/hbaseClient.py", line 685, in put_json_dump
    self.client.mutateRow('crash_reports', row_id, mutationList) # unit test marker 233
2011-08-19 05:02:15,665 ERROR - Thread-4 - File "/data/socorro/thirdparty/hbase/hbase.py", line 1251, in mutateRow
    self.recv_mutateRow()
2011-08-19 05:02:15,666 ERROR - Thread-4 - File "/data/socorro/thirdparty/hbase/hbase.py", line 1264, in recv_mutateRow
    (fname, mtype, rseqid) = self._iprot.readMessageBegin()
2011-08-19 05:02:15,666 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
    sz = self.readI32()
2011-08-19 05:02:15,667 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/protocol/TBinaryProtocol.py", line 203, in readI32
    buff = self.trans.readAll(4)
2011-08-19 05:02:15,667 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TTransport.py", line 58, in readAll
    chunk = self.read(sz-have)
2011-08-19 05:02:15,668 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TTransport.py", line 155, in read
    self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER)))
2011-08-19 05:02:15,668 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TSocket.py", line 92, in read
    buff = self.handle.recv(sz)
2011-08-19 05:02:15,669 ERROR - Thread-4 - FatalException: the connection is not viable.  retries fail:

Updated

7 years ago
Severity: major → critical

Updated

7 years ago
Duplicate of this bug: 670243

Updated

7 years ago
Duplicate of this bug: 680421
(Reporter)

Updated

7 years ago
Group: infra
tmary any updates?
(Assignee)

Comment 12

7 years ago
Some regions in HBase are still down. HBase devs / Cloudera are trying to come up with a temporary fix to get the Regions online

Till such time that a fix works, it is difficult to estimate ETA etc. 

--
(Assignee)

Comment 13

7 years ago
Apart from affected data (regions corresponding to data from 18/19-Aug-2011), service is up/available. Re affected regions, Cloudera support and Michael Stack are working towards fixing them - unfortunately, no ETA

--
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
I have a hacking session planned with Stack and Cloudera today at 3 Pacific.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.