Closed
Bug 626020
Opened 15 years ago
Closed 15 years ago
Production Socorro Monitor not able to connect to HBase
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rhelmer, Unassigned)
Details
Monitor in MPT is not able to connect to HBase following the recent outage (bug 626007). Tried restarting. Here is a traceback:
2011-01-15 00:05:14,792 DEBUG - MainThread - creating crashStorePool
2011-01-15 00:05:14,792 INFO - priorityLoopingThread - priorityJobAllocationLoop starting.
2011-01-15 00:05:14,792 INFO - jobCleanupThread - jobCleanupLoop starting.
2011-01-15 00:05:14,792 INFO - jobCleanupThread - beginning jobCleanupLoop cycle.
2011-01-15 00:05:14,793 DEBUG - jobCleanupThread - dealing with completed and failed jobs
2011-01-15 00:05:14,793 DEBUG - MainThread - creating crashStore for MainThread
2011-01-15 00:05:14,793 INFO - connecting to hbase
2011-01-15 00:05:14,793 DEBUG - make_connection, timeout = 5000
2011-01-15 00:05:14,796 DEBUG - connection successful
2011-01-15 00:05:14,801 DEBUG - priorityLoopingThread - sleeping
2011-01-15 00:05:14,803 DEBUG - jobCleanupThread - starting deletion
2011-01-15 00:05:14,804 INFO - MainThread - looking for dead processors
2011-01-15 00:05:14,804 INFO - MainThread - threshold 0:15:00
2011-01-15 00:05:14,808 DEBUG - jobCleanupThread - end of this cleanup iteration
2011-01-15 00:05:14,810 INFO - MainThread - dead processors: []
2011-01-15 00:05:14,810 DEBUG - MainThread - getting jobSchedulerIter
2011-01-15 00:05:14,810 DEBUG - MainThread - beginning index scan
2011-01-15 00:05:14,810 DEBUG - MainThread - starting destructiveDateWalk
2011-01-15 00:05:14,810 DEBUG - MainThread - retry_wrapper_for_generators: trying first time, iterator_for_all_legacy_to_be_processed
2011-01-15 00:05:14,810 DEBUG - iterator_for_all_legacy_to_be_processed
2011-01-15 00:05:14,810 DEBUG - limit = 1000000
2011-01-15 00:05:14,859 DEBUG - Scanner 0 generated
2011-01-15 00:05:17,805 DEBUG - Scanner 1 generated
2011-01-15 00:05:20,481 DEBUG - Scanner 2 generated
2011-01-15 00:05:23,057 DEBUG - Scanner 3 generated
2011-01-15 00:05:25,853 DEBUG - Scanner 4 generated
2011-01-15 00:05:30,852 DEBUG - MainThread - retry_wrapper_for_generators: handled exception, threading.currentThread().getName(), timed out
2011-01-15 00:05:30,852 DEBUG - MainThread - retry_wrapper_for_generators: about to retry connection
2011-01-15 00:05:30,853 DEBUG - make_connection, timeout = 5000
2011-01-15 00:05:30,854 DEBUG - connection successful
2011-01-15 00:05:30,854 DEBUG - MainThread - retry_wrapper_for_generators: about to retry function, iterator_for_all_legacy_to_be_processed
2011-01-15 00:05:30,854 DEBUG - iterator_for_all_legacy_to_be_processed
2011-01-15 00:05:30,855 DEBUG - limit = 1000000
2011-01-15 00:05:30,892 DEBUG - Scanner 0 generated
2011-01-15 00:05:33,989 DEBUG - Scanner 1 generated
2011-01-15 00:05:36,717 DEBUG - Scanner 2 generated
2011-01-15 00:05:39,313 DEBUG - Scanner 3 generated
2011-01-15 00:05:42,384 DEBUG - Scanner 4 generated
2011-01-15 00:05:45,318 DEBUG - Scanner 5 generated
2011-01-15 00:05:48,259 DEBUG - Scanner 6 generated
2011-01-15 00:05:50,741 DEBUG - Scanner 7 generated
2011-01-15 00:05:55,740 DEBUG - MainThread - retry_wrapper_for_generators: handled exception, threading.currentThread().getName(), timed out
2011-01-15 00:05:55,740 ERROR - MainThread - retry_wrapper_for_generators: failed too many times on this one operation, iterator_for_all_legacy_to_be_processed
2011-01-15 00:05:55,741 DEBUG - MainThread - somethings gone horribly wrong with HBase
2011-01-15 00:05:55,741 CRITICAL - MainThread Caught Error: socorro.hbase.hbaseClient.FatalException
2011-01-15 00:05:55,741 CRITICAL - the connection is not viable. retries fail:
2011-01-15 00:05:55,741 CRITICAL - trace back follows:
File "/data/breakpad/processor/socorro/monitor/monitor.py", line 353, in standardJobAllocationLoop
for uuid in crashStorage.newUuids():
File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 106, in f
for x in fn(self, *args, **kwargs):
File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 573, in iterator_for_all_legacy_to_be_processed
['ids:ooid'])):
File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 546, in merge_scan_with_prefix
next_items_queue.append([next(),i,next])
File "/data/breakpad/processor/socorro/hbase/hbaseClient.py", line 745, in salted_scanner_iterable
raw_rows = client.scannerGet(scanner)
File "/data/breakpad/processor_1/thirdparty/hbase/hbase.py", line 1892, in scannerGet
return self.recv_scannerGet()
File "/data/breakpad/processor_1/thirdparty/hbase/hbase.py", line 1903, in recv_scannerGet
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "/data/breakpad/processor_1/thirdparty/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
sz = self.readI32()
File "/data/breakpad/processor_1/thirdparty/thrift/protocol/TBinaryProtocol.py", line 203, in readI32
buff = self.trans.readAll(4)
File "/data/breakpad/processor_1/thirdparty/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz-have)
File "/data/breakpad/processor_1/thirdparty/thrift/transport/TTransport.py", line 155, in read
self.__rbuf = StringIO(self.__trans.read(max(sz, self.DEFAULT_BUFFER)))
File "/data/breakpad/processor_1/thirdparty/thrift/transport/TSocket.py", line 92, in read
buff = self.handle.recv(sz)
2011-01-15 00:05:55,742 DEBUG - MainThread - standardLoop done.
2011-01-15 00:05:55,742 DEBUG - MainThread - waiting to join.
2011-01-15 00:05:55,801 DEBUG - priorityLoopingThread - outer detects quit
2011-01-15 00:05:55,802 INFO - priorityLoopingThread - priorityLoop done.
2011-01-15 00:05:55,807 DEBUG - jobCleanupThread - got quit message
2011-01-15 00:05:55,807 INFO - jobCleanupThread - jobCleanupLoop done.
2011-01-15 00:05:55,808 DEBUG - MainThread - calling databaseConnectionPool.cleanup().
2011-01-15 00:05:55,808 DEBUG - MainThread - killing database connections
2011-01-15 00:05:55,808 DEBUG - MainThread - connection MainThread already closed
2011-01-15 00:05:55,808 DEBUG - MainThread - connection jobCleanupThread closed
2011-01-15 00:05:55,808 DEBUG - MainThread - connection priorityLoopingThread closed
2011-01-15 00:05:55,808 DEBUG - MainThread - crashStore MainThread closed
Comment 1•15 years ago
|
||
We can defnitely talk to hbase from the monitor..
[processor@cm-breakpad03 ~]$ telnet 10.2.72.247 9090
Trying 10.2.72.247...
Connected to thrift-socorro.internal.acelb.mozilla.org (10.2.72.247).
Escape character is '^]'.
^]
telnet> q
Connection closed.
[processor@cm-breakpad03 ~]$
The best I can tell is that the monitor seems to be timing out generating these crash lists from hbase to process.
Comment 2•15 years ago
|
||
That is an accurate assessment of what the problem was.
I ran the merge scan hbaseClient method from the command line and it timed out. On the server side, it returned an extremely puzzling error about an "UnknownScanner". For some reason, one of the 16 scanners the merge scan generates was getting lost somewhere in the middle of things.
We need to do more research on what exactly caused this error, but in the interest of trying to get things back up and running, I attempted to perform a major compact of the legacy unprocessed queue table. This took a few seconds and after completion, the merge scan began working properly.
I can see that the master is pulling entries from the queue now. Watching the crash-stats status page, the number of pending jobs climbed up to 49k but then started slowly lowering so it looks like the processors are chewing through the backlog.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
| Assignee | ||
Updated•14 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•