Closed Bug 626020 Opened 15 years ago Closed 15 years ago

Production Socorro Monitor not able to connect to HBase

Categories

(Socorro :: General, task)

x86_64
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Unassigned)

Details

Monitor in MPT is not able to connect to HBase following the recent outage (bug 626007). Tried restarting. Here is a traceback: 2011-01-15 00:05:14,792 DEBUG - MainThread - creating crashStorePool 2011-01-15 00:05:14,792 INFO - priorityLoopingThread - priorityJobAllocationLoop starting. 2011-01-15 00:05:14,792 INFO - jobCleanupThread - jobCleanupLoop starting. 2011-01-15 00:05:14,792 INFO - jobCleanupThread - beginning jobCleanupLoop cycle. 2011-01-15 00:05:14,793 DEBUG - jobCleanupThread - dealing with completed and failed jobs 2011-01-15 00:05:14,793 DEBUG - MainThread - creating crashStore for MainThread 2011-01-15 00:05:14,793 INFO - connecting to hbase 2011-01-15 00:05:14,793 DEBUG - make_connection, timeout = 5000 2011-01-15 00:05:14,796 DEBUG - connection successful 2011-01-15 00:05:14,801 DEBUG - priorityLoopingThread - sleeping 2011-01-15 00:05:14,803 DEBUG - jobCleanupThread - starting deletion 2011-01-15 00:05:14,804 INFO - MainThread - looking for dead processors 2011-01-15 00:05:14,804 INFO - MainThread - threshold 0:15:00 2011-01-15 00:05:14,808 DEBUG - jobCleanupThread - end of this cleanup iteration 2011-01-15 00:05:14,810 INFO - MainThread - dead processors: [] 2011-01-15 00:05:14,810 DEBUG - MainThread - getting jobSchedulerIter 2011-01-15 00:05:14,810 DEBUG - MainThread - beginning index scan 2011-01-15 00:05:14,810 DEBUG - MainThread - starting destructiveDateWalk 2011-01-15 00:05:14,810 DEBUG - MainThread - retry_wrapper_for_generators: trying first time, iterator_for_all_legacy_to_be_processed 2011-01-15 00:05:14,810 DEBUG - iterator_for_all_legacy_to_be_processed 2011-01-15 00:05:14,810 DEBUG - limit = 1000000 2011-01-15 00:05:14,859 DEBUG - Scanner 0 generated 2011-01-15 00:05:17,805 DEBUG - Scanner 1 generated 2011-01-15 00:05:20,481 DEBUG - Scanner 2 generated 2011-01-15 00:05:23,057 DEBUG - Scanner 3 generated 2011-01-15 00:05:25,853 DEBUG - Scanner 4 generated 2011-01-15 00:05:30,852 DEBUG - MainThread - retry_wrapper_for_generators: handled exception, threading.currentThread().getName(), timed out 2011-01-15 00:05:30,852 DEBUG - MainThread - retry_wrapper_for_generators: about to retry connection 2011-01-15 00:05:30,853 DEBUG - make_connection, timeout = 5000 2011-01-15 00:05:30,854 DEBUG - connection successful 2011-01-15 00:05:30,854 DEBUG - MainThread - retry_wrapper_for_generators: about to retry function, iterator_for_all_legacy_to_be_processed 2011-01-15 00:05:30,854 DEBUG - iterator_for_all_legacy_to_be_processed 2011-01-15 00:05:30,855 DEBUG - limit = 1000000 2011-01-15 00:05:30,892 DEBUG - Scanner 0 generated 2011-01-15 00:05:33,989 DEBUG - Scanner 1 generated 2011-01-15 00:05:36,717 DEBUG - Scanner 2 generated 2011-01-15 00:05:39,313 DEBUG - Scanner 3 generated 2011-01-15 00:05:42,384 DEBUG - Scanner 4 generated 2011-01-15 00:05:45,318 DEBUG - Scanner 5 generated 2011-01-15 00:05:48,259 DEBUG - Scanner 6 generated 2011-01-15 00:05:50,741 DEBUG - Scanner 7 generated 2011-01-15 00:05:55,740 DEBUG - MainThread - retry_wrapper_for_generators: handled exception, threading.currentThread().getName(), timed out 2011-01-15 00:05:55,740 ERROR - MainThread - retry_wrapper_for_generators: failed too many times on this one operation, iterator_for_all_legacy_to_be_processed 2011-01-15 00:05:55,741 DEBUG - MainThread - somethings gone horribly wrong with HBase 2011-01-15 00:05:55,741 CRITICAL - MainThread Caught Error: socorro.hbase.hbaseClient.FatalException 2011-01-15 00:05:55,741 CRITICAL - the connection is not viable. retries fail: 2011-01-15 00:05:55,741 CRITICAL - trace back follows: File "/data/breakpad/processor/socorro/monitor/monitor.py", line 353, in standardJobAllocationLoop for uuid in crashStorage.newUuids(): File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 106, in f for x in fn(self, *args, **kwargs): File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 573, in iterator_for_all_legacy_to_be_processed ['ids:ooid'])): File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 546, in merge_scan_with_prefix next_items_queue.append([next(),i,next]) File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 745, in salted_scanner_iterable raw_rows = client.scannerGet(scanner) File "/data/breakpad/processor_1/thirdparty/hbase/hbase.py", line 1892, in scannerGet return self.recv_scannerGet() File "/data/breakpad/processor_1/thirdparty/hbase/hbase.py", line 1903, in recv_scannerGet (fname, mtype, rseqid) = self._iprot.readMessageBegin() File "/data/breakpad/processor_1/thirdparty/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin sz = self.readI32() File "/data/breakpad/processor_1/thirdparty/thrift/protocol/TBinaryProtocol.py", line 203, in readI32 buff = self.trans.readAll(4) File "/data/breakpad/processor_1/thirdparty/thrift/transport/TTransport.py", line 58, in readAll chunk = self.read(sz-have) File "/data/breakpad/processor_1/thirdparty/thrift/transport/TTransport.py", line 155, in read self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER))) File "/data/breakpad/processor_1/thirdparty/thrift/transport/TSocket.py", line 92, in read buff = self.handle.recv(sz) 2011-01-15 00:05:55,742 DEBUG - MainThread - standardLoop done. 2011-01-15 00:05:55,742 DEBUG - MainThread - waiting to join. 2011-01-15 00:05:55,801 DEBUG - priorityLoopingThread - outer detects quit 2011-01-15 00:05:55,802 INFO - priorityLoopingThread - priorityLoop done. 2011-01-15 00:05:55,807 DEBUG - jobCleanupThread - got quit message 2011-01-15 00:05:55,807 INFO - jobCleanupThread - jobCleanupLoop done. 2011-01-15 00:05:55,808 DEBUG - MainThread - calling databaseConnectionPool.cleanup(). 2011-01-15 00:05:55,808 DEBUG - MainThread - killing database connections 2011-01-15 00:05:55,808 DEBUG - MainThread - connection MainThread already closed 2011-01-15 00:05:55,808 DEBUG - MainThread - connection jobCleanupThread closed 2011-01-15 00:05:55,808 DEBUG - MainThread - connection priorityLoopingThread closed 2011-01-15 00:05:55,808 DEBUG - MainThread - crashStore MainThread closed
We can defnitely talk to hbase from the monitor.. [processor@cm-breakpad03 ~]$ telnet 10.2.72.247 9090 Trying 10.2.72.247... Connected to thrift-socorro.internal.acelb.mozilla.org (10.2.72.247). Escape character is '^]'. ^] telnet> q Connection closed. [processor@cm-breakpad03 ~]$ The best I can tell is that the monitor seems to be timing out generating these crash lists from hbase to process.
That is an accurate assessment of what the problem was. I ran the merge scan hbaseClient method from the command line and it timed out. On the server side, it returned an extremely puzzling error about an "UnknownScanner". For some reason, one of the 16 scanners the merge scan generates was getting lost somewhere in the middle of things. We need to do more research on what exactly caused this error, but in the interest of trying to get things back up and running, I attempted to perform a major compact of the legacy unprocessed queue table. This took a few seconds and after completion, the merge scan began working properly. I can see that the master is pulling entries from the queue now. Watching the crash-stats status page, the number of pending jobs climbed up to 49k but then started slowly lowering so it looks like the processors are chewing through the backlog.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.