Closed Bug 599947 Opened 14 years ago Closed 14 years ago

Very slow requests to Socorro HBase cluster causing client side failures

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

Unreviewed

People

(Reporter: dre, Assigned: dre)

Details

Daniel Einspanjer [:dre] [:deinspanjer]

Assignee

Description

•

14 years ago

Requests like 
get 'crash_reports', '310092732c9d433-578c-46b4-8a5d-ea0182100927'
and 
python26 socorro/storage/hbaseClient.py -h cm-hadoop06 -t 60000 merge_scan_with_prefix crash_reports_index_legacy_unprocessed_flag '' ids:ooid 10

take so long the client times out.

No known cause as of yet.

No missing regions that we've seen, when doing a get from the hbase shell, the requests succeed after about 5 to 10 minutes.

No unusual errors so far in cluster logs.

Ran
 netstat -pnt | awk '{print $6}' | sort | uniq -c

it reports that most worker nodes have about 8500 established connections and 1000 close_waits.

Laura Thomson :laura

Comment 1

•

14 years ago

On the Socorro side, we've lowered the hbaseTimeout to 500ms.  Seems like the majority of reports are failing out and going into fallback storage (NFS).

Cloudera Support and Stack are on the box and trying to debug.

Daniel Einspanjer [:dre] [:deinspanjer]

Assignee

Comment 2

•

14 years ago

We shut down the thrift servers and after a bit the cluster became responsive again.  Next immediate steps are to restart and monitor.

K Lars Lohn [:lars] [:klohn]

Comment 3

•

14 years ago

just to clarify, the fallback storage to which Laura refers is not NFS.  It is a local temporary storage living on the collector boxes themselves.

Daniel Einspanjer [:dre] [:deinspanjer]

Assignee

Comment 4

•

14 years ago

Restarted and after adjusting collector timeouts again, things seem to be up and running smoothly again.

The failure seems to be related to a failed disk on a node that happened to be hosting a critical piece of HBase metadata at the time.  The failed disk didn't cause anything to stop working (which would have kicked the node out of the cluster, but rather just slowed all disk IO on that machine which eventually cascaded to slow operations across the cluster.

Daniel Einspanjer [:dre] [:deinspanjer]

Assignee

Updated

•

14 years ago

Assignee: nobody → deinspanjer

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Very slow requests to Socorro HBase cluster causing client side failures

Categories

(Mozilla Metrics :: Hadoop/HBase Operations, defect)

Tracking

(Not tracked)

People

(Reporter: dre, Assigned: dre)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated