680348 - Socorro Thrift Prod connection not viable

Reporter

Description

•

14 years ago

21:52:27 < nagios-phx1> [121] sp-admin01.phx1:Thrift - socorro-thrift1.zlb.phx1.mozilla.com is CRITICAL: CRITICAL: Thrift connection to socorro-thrift1.zlb.phx1.mozilla.com is not viable hbase may need to be looked into, per mana. However, there were no connection alerts from processors preceding this.

Ashish Vijayaram [:ashish]

Reporter

Comment 1

•

14 years ago

22:43:22 < nagios-phx1> [192] tp-socorro01-master01.phx1:PostgreSQL Last Reports Update is WARNING: CHECKGANGLIA WARNING: last_record_reports is 3947.00 ^^ is quite likely because of this Called tmary a few times but no answer :(

Ashish Vijayaram [:ashish]

Reporter

Comment 2

•

14 years ago

Unable to wake up deinspanjer or xstevens either ;(

Shyam Mani [:fox2mike]

Comment 3

•

14 years ago

rhelmer thinks this needs action ASAP rather than later. Have asked ashish to page cshields.

Shyam Mani [:fox2mike]

Comment 4

•

14 years ago

(In reply to Shyam Mani [:fox2mike] from comment #3) > rhelmer thinks this needs action ASAP rather than later. Have asked ashish > to page cshields. Daniel is online, so didn't page Corey.

Severity: major → critical

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 5

•

14 years ago

The cluster shut down because the namenode service suddenly stopped. I have looked a bit to try to figure out why and haven't seen a cause yet. I restarted the NN, then restarted HBase RegionServers until they were all steady-on and reloading regions. The cluster seems to be coming back up fine. Need tmary to do some investigation as to the cause of the failure.

Assignee: server-ops → tmeyarivan

Severity: critical → major

Ashish Vijayaram [:ashish]

Reporter

Comment 6

•

14 years ago

Last Reports Update on tp-socorro01-master01.phx1 reduced for a short while before trending up again after deinspanjer's fix. It's now at ~11300, looping in Socorro devs before this goes critical.

Ashish Vijayaram [:ashish]

Reporter

Comment 7

•

14 years ago

Left a voicemail on Laura's phone. Bright side - last_record_reports is at 10603.00

Laura Thomson :laura

Comment 8

•

14 years ago

Still not right. Looking at e.g. crashmover logs, I can see that Thrift connections sometimes succeed but mostly only on a retry. There's a lot of these errors, below. I would say more connections are failing than succeeding, but some are succeeding. 2011-08-19 05:02:15,661 DEBUG - Thread-4 - Thread-4 - retry_wrapper: handled exception, timed out 2011-08-19 05:02:15,661 ERROR - Thread-4 - Caught Error: <class 'socorro.storage.hbaseClient.FatalException'> 2011-08-19 05:02:15,662 ERROR - Thread-4 - the connection is not viable. retries fail: 2011-08-19 05:02:15,662 ERROR - Thread-4 - trace back follows: 2011-08-19 05:02:15,663 ERROR - Thread-4 - Traceback (most recent call last): 2011-08-19 05:02:15,664 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/crashstorage.py", line 301, in save_raw self.hbaseConnection.put_json_dump(uuid, jsonData, dump, number_of_retries=2) 2011-08-19 05:02:15,664 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/hbaseClient.py", line 144, in f result = fn(self, *args, **kwargs) 2011-08-19 05:02:15,665 ERROR - Thread-4 - File "/data/socorro/application/socorro/storage/hbaseClient.py", line 685, in put_json_dump self.client.mutateRow('crash_reports', row_id, mutationList) # unit test marker 233 2011-08-19 05:02:15,665 ERROR - Thread-4 - File "/data/socorro/thirdparty/hbase/hbase.py", line 1251, in mutateRow self.recv_mutateRow() 2011-08-19 05:02:15,666 ERROR - Thread-4 - File "/data/socorro/thirdparty/hbase/hbase.py", line 1264, in recv_mutateRow (fname, mtype, rseqid) = self._iprot.readMessageBegin() 2011-08-19 05:02:15,666 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin sz = self.readI32() 2011-08-19 05:02:15,667 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/protocol/TBinaryProtocol.py", line 203, in readI32 buff = self.trans.readAll(4) 2011-08-19 05:02:15,667 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TTransport.py", line 58, in readAll chunk = self.read(sz-have) 2011-08-19 05:02:15,668 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TTransport.py", line 155, in read self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER))) 2011-08-19 05:02:15,668 ERROR - Thread-4 - File "/data/socorro/thirdparty/thrift/transport/TSocket.py", line 92, in read buff = self.handle.recv(sz) 2011-08-19 05:02:15,669 ERROR - Thread-4 - FatalException: the connection is not viable. retries fail:

Laura Thomson :laura

Updated

•

14 years ago

Severity: major → critical

Ashish Vijayaram [:ashish]

Reporter

Updated

•

14 years ago

Group: infra

Laura Thomson :laura

Comment 11

•

14 years ago

tmary any updates?

T [:tmary] Meyarivan

Assignee

Comment 12

•

14 years ago

Some regions in HBase are still down. HBase devs / Cloudera are trying to come up with a temporary fix to get the Regions online Till such time that a fix works, it is difficult to estimate ETA etc. --

T [:tmary] Meyarivan

Assignee

Comment 13

•

14 years ago

Apart from affected data (regions corresponding to data from 18/19-Aug-2011), service is up/available. Re affected regions, Cloudera support and Michael Stack are working towards fixing them - unfortunately, no ETA --

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 14

•

14 years ago

I have a hacking session planned with Stack and Cloudera today at 3 Pacific.

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Socorro Thrift Prod connection not viable

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: ashish, Assigned: tmary)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Updated