If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Very slow requests to Socorro HBase cluster causing client side failures

RESOLVED FIXED in Unreviewed

Status

Mozilla Metrics
Hadoop/HBase Operations
--
blocker
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: dre, Assigned: dre)

Tracking

unspecified
Unreviewed

Details

Requests like 
get 'crash_reports', '310092732c9d433-578c-46b4-8a5d-ea0182100927'
and 
python26 socorro/storage/hbaseClient.py -h cm-hadoop06 -t 60000 merge_scan_with_prefix crash_reports_index_legacy_unprocessed_flag '' ids:ooid 10

take so long the client times out.

No known cause as of yet.

No missing regions that we've seen, when doing a get from the hbase shell, the requests succeed after about 5 to 10 minutes.

No unusual errors so far in cluster logs.

Ran
 netstat -pnt | awk '{print $6}' | sort | uniq -c

it reports that most worker nodes have about 8500 established connections and 1000 close_waits.

Comment 1

7 years ago
On the Socorro side, we've lowered the hbaseTimeout to 500ms.  Seems like the majority of reports are failing out and going into fallback storage (NFS).

Cloudera Support and Stack are on the box and trying to debug.
We shut down the thrift servers and after a bit the cluster became responsive again.  Next immediate steps are to restart and monitor.
just to clarify, the fallback storage to which Laura refers is not NFS.  It is a local temporary storage living on the collector boxes themselves.
Restarted and after adjusting collector timeouts again, things seem to be up and running smoothly again.

The failure seems to be related to a failed disk on a node that happened to be hosting a critical piece of HBase metadata at the time.  The failed disk didn't cause anything to stop working (which would have kicked the node out of the cluster, but rather just slowed all disk IO on that machine which eventually cascaded to slow operations across the cluster.
Assignee: nobody → deinspanjer
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.