Closed Bug 599947 Opened 14 years ago Closed 14 years ago

Very slow requests to Socorro HBase cluster causing client side failures

Categories

(Mozilla Metrics :: Hadoop/HBase Operations, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED
Unreviewed

People

(Reporter: dre, Assigned: dre)

Details

Requests like 
get 'crash_reports', '310092732c9d433-578c-46b4-8a5d-ea0182100927'
and 
python26 socorro/storage/hbaseClient.py -h cm-hadoop06 -t 60000 merge_scan_with_prefix crash_reports_index_legacy_unprocessed_flag '' ids:ooid 10

take so long the client times out.

No known cause as of yet.

No missing regions that we've seen, when doing a get from the hbase shell, the requests succeed after about 5 to 10 minutes.

No unusual errors so far in cluster logs.

Ran
 netstat -pnt | awk '{print $6}' | sort | uniq -c

it reports that most worker nodes have about 8500 established connections and 1000 close_waits.
On the Socorro side, we've lowered the hbaseTimeout to 500ms.  Seems like the majority of reports are failing out and going into fallback storage (NFS).

Cloudera Support and Stack are on the box and trying to debug.
We shut down the thrift servers and after a bit the cluster became responsive again.  Next immediate steps are to restart and monitor.
just to clarify, the fallback storage to which Laura refers is not NFS.  It is a local temporary storage living on the collector boxes themselves.
Restarted and after adjusting collector timeouts again, things seem to be up and running smoothly again.

The failure seems to be related to a failed disk on a node that happened to be hosting a critical piece of HBase metadata at the time.  The failed disk didn't cause anything to stop working (which would have kicked the node out of the cluster, but rather just slowed all disk IO on that machine which eventually cascaded to slow operations across the cluster.
Assignee: nobody → deinspanjer
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.