Closed
Bug 599947
Opened 14 years ago
Closed 14 years ago
Very slow requests to Socorro HBase cluster causing client side failures
Categories
(Mozilla Metrics :: Hadoop/HBase Operations, defect)
Mozilla Metrics
Hadoop/HBase Operations
Tracking
(Not tracked)
RESOLVED
FIXED
Unreviewed
People
(Reporter: dre, Assigned: dre)
Details
Requests like get 'crash_reports', '310092732c9d433-578c-46b4-8a5d-ea0182100927' and python26 socorro/storage/hbaseClient.py -h cm-hadoop06 -t 60000 merge_scan_with_prefix crash_reports_index_legacy_unprocessed_flag '' ids:ooid 10 take so long the client times out. No known cause as of yet. No missing regions that we've seen, when doing a get from the hbase shell, the requests succeed after about 5 to 10 minutes. No unusual errors so far in cluster logs. Ran netstat -pnt | awk '{print $6}' | sort | uniq -c it reports that most worker nodes have about 8500 established connections and 1000 close_waits.
Comment 1•14 years ago
|
||
On the Socorro side, we've lowered the hbaseTimeout to 500ms. Seems like the majority of reports are failing out and going into fallback storage (NFS). Cloudera Support and Stack are on the box and trying to debug.
Assignee | ||
Comment 2•14 years ago
|
||
We shut down the thrift servers and after a bit the cluster became responsive again. Next immediate steps are to restart and monitor.
Comment 3•14 years ago
|
||
just to clarify, the fallback storage to which Laura refers is not NFS. It is a local temporary storage living on the collector boxes themselves.
Assignee | ||
Comment 4•14 years ago
|
||
Restarted and after adjusting collector timeouts again, things seem to be up and running smoothly again. The failure seems to be related to a failed disk on a node that happened to be hosting a critical piece of HBase metadata at the time. The failed disk didn't cause anything to stop working (which would have kicked the node out of the cluster, but rather just slowed all disk IO on that machine which eventually cascaded to slow operations across the cluster.
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → deinspanjer
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•