HBase insert throughput is too slow (PHX)

RESOLVED FIXED in 1.7.6

Status

Socorro
General
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: laura, Assigned: xstevens)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

7 years ago
Filing this as a place for investigations.

Smoke testing reveals throughput in PHX is too slow for production.  From rhelmer's report:
Executive summary:
Inserting to HBase is our bottleneck for receiving crashes. It is far slower than expected, and likely indicates a problem with
the setup in PHX. We do not know if this is Zeus, the network, or something else, and could use help from Metrics and IT on this.

Possible next step: test with collectors communicating to hbase/thrift without Zeus to eliminate this as possibility (if nothing else obvious jumps out).

-- 
2011-01-13 16:41:28
  started 10 node test, 24k crashes each (inserting as quickly as possible)
  collector is storing straight to hbase.
-- 
2011-01-13 17:47:11
  only 14774 crashes inserted; we stop the test because this is far too slow
  avg crashes per minute: ~20
  avg seconds to insert crash: ~10
-- 
2011-01-13 18:00:44
  at jberkus' suggestion, we try enabling localFS storage system on collectors.
  collector is storing straight to local disk.
-- 
2011-01-13 18:14:28
  test complete, all 240k crashes on 10 nodes inserted.
  avg crashes per minute: ~1700
  avg seconds to insert crash: ~0.12s
-- 
2011-01-13 19:47:56
  collectors have inserted all crashes from local disk to hbase
-- 

Ganglia has more details, but here are a couple graphs per-node (need MPT-VPN):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_inserts.png
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_insert_time.png
(Reporter)

Updated

7 years ago
Assignee: deinspanjer → xstevens
A cursory view of zeus doesn't show any errors or resource starvation.. Ditto when spot-checking yesterday's stats on a node (hp-node01).

asking for another quick test on irc so I can take a look in real time.
(Assignee)

Comment 2

7 years ago
So lars gathered some benchmarks from the python side.  Basically two types of calls 1 for atomicIncrement on metrics table the rest are Puts via Thrift.  I did some quick analysis on the log data he gave and here it is (V3 is the time it took):

> summary(atomics$V3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1208    3523    6008   12550   14450  346100 
> summary(mutates$V3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1660    6036    7410   17760   13110 3079000 

Trying to get clarification if those times are in millis or nanos?  If its millis though this is horribly slow.
for the benefit of those that didn't see my response on IRC, the numbers are in microseconds
(Assignee)

Comment 4

7 years ago
Okay so if these are in microseconds.  Where are we slowing down at?  Obviously the max is of some concern here but the 3rd quartile is looking pretty good.
Can we have more detail on those numbers?  They don't match at all what we were seeing in the full system test.  What's different?  How were they generated?
here's where those numbers came from.  The python code uses an hbase client library.  A single insert of a crash boils down to being a collection of calls into the hbase client library. There are two calls: atomicIncrement and mutateRow.  Each crash insert is about three mutateRow calls and a lot of atomicIncrements.  At xstevens request, I benchmarked those calls.  Each call logs the amount of time that it took.
(Assignee)

Comment 7

7 years ago
So the new numbers from this evenings test are:

> summary(mutates$V3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1617    2404    2622    8804    9916 2025000 
> summary(atomics$V3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1229    1761    1988    2380    2302  478800 
> summary(jsonputs$V3)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   32230    58000    64740   198100    78080 16240000
> quantile(jsonputs$V3, c(0.75, 0.80, 0.85, 0.90, 0.95, 0.99))
      75%       80%       85%       90%       95%       99% 
 78082.75  90072.00 108685.75 125745.70 162498.45 634906.36

These numbers look much much better.  Median put_json_dump time improved by 74% and the third quantile improved by 78.85%.
(Assignee)

Comment 8

7 years ago
Just so everyone knows this improvement was due to making some hadoop/hbase config changes.  I audited our configs earlier and realized we had some key things missing between SJC and PHX.  This was largely because I used research as a template for PHX because they were CDH3 configs, but research doesn't use the same parameters as production does.
I should note that xstevens posting is from the low pressure test.  I will get the high pressure test data to him soon.

we did two load tests today - one high pressure and one low pressure.  I wanted to find out how the benchmark times in change in high vs low load.  I'm also interested in the ratio between the total benchmark time of the 'put_json_dump' call is to the sum of the 'mutate' and 'atomicIncrement' parts.  

During this analysis, I discovered that HBase timeouts happen nearly precisely every 30 seconds on every thread in every test.  It doesn't matter if it is a high load or low load.  I verified that this happened in yesterdays tests, too.  The connections do not timeout simultaneously.  I don't know what this means, but is sure is interesting.

On my ratio investigation, eyeballing it suggests that the single thread test had a better ratio of benchmarked put_json to the sum of the parts.  The higher lead tests seemed to be farther off.  I'm wondering if we're seeing threads being blocked by python's threading woes.  That would account for the discrepancy.  The counter argument is that the straight collector to hbase test are also being slow.  We should retest collector direct submitting with this benchmarking to see the differences.

My next step is to massage the datafile from todays tests so that we can more precisely analyze the change in the ratio between the low and high pressure tests.

So we've got multiple vectors to investigate here.  If anyone has opinions as to why some investigations are likely dead ends or unfruitful, please speak up.
(In reply to comment #8)
> I audited our configs earlier and realized we had some key
> things missing between SJC and PHX.

@jabba - you'll make sure this gets reflected in puppet?
> During this analysis, I discovered that HBase timeouts happen nearly precisely
> every 30 seconds on every thread in every test.  It doesn't matter if it is a
> high load or low load.  I verified that this happened in yesterdays tests, too.
>  The connections do not timeout simultaneously.  I don't know what this means,
> but is sure is interesting.

I'd like to rule out networking - does someone have a doc or diagram of the network data flow and which hosts or pieces of hardware traffic touches?  

For instance, could be some session state issue in Zeus that resets every 30 seconds but I don't know how Zeus ties into this or the namespace nodes (different than sjc namespace nodes).
(Assignee)

Comment 12

7 years ago
mrz, the config issue doesn't have anything to do with puppet problems.  We push different configs for different clusters.
(In reply to comment #11)
> For instance, could be some session state issue in Zeus that resets every 30
> seconds but I don't know how Zeus ties into this or the namespace nodes
> (different than sjc namespace nodes).

In zeus we have a 10 sec timeout on new connections without data, and 300 second timeout on idle connections.

If anyone would like to test again this morning I've gone ahead and removed both for socorro-thrift1 since I will be leaving for church soon.

If this does not make a difference I would like to put them back.
(Reporter)

Comment 14

7 years ago
(In reply to comment #11)
> > During this analysis, I discovered that HBase timeouts happen nearly precisely
> > every 30 seconds on every thread in every test.  It doesn't matter if it is a
> > high load or low load.  I verified that this happened in yesterdays tests, too.
> >  The connections do not timeout simultaneously.  I don't know what this means,
> > but is sure is interesting.
> 
> I'd like to rule out networking - does someone have a doc or diagram of the
> network data flow and which hosts or pieces of hardware traffic touches?  
> 

https://wiki.mozilla.org/Socorro_Communication_Paths
(In reply to comment #13)
> In zeus we have a 10 sec timeout on new connections without data, and 300
> second timeout on idle connections.
> 
> If anyone would like to test again this morning I've gone ahead and removed
> both for socorro-thrift1 since I will be leaving for church soon.
> 
> If this does not make a difference I would like to put them back.

This did not make a difference and has been reverted.  Additional info:

11:23 < lars> the timeout is coming to me as an exception socket.timeout
11:25 < lars> it's being raised from within the hbase.client api
Created attachment 504301 [details] [diff] [review]
hbase benchmarking patch for Socorro 1.7.6
Created attachment 504302 [details] [diff] [review]
hbase benchmarking patch for Socorro 1.7.5.5
(Assignee)

Comment 18

7 years ago
Referring back to the numbers in comment #7:

At 3rd Quartile That's 768 crashes/minute per collector I believe.
Given that our highest peak of crashes per minute is 2.2k, that means that with just 3 collectors, we could reliably handle this load.  That sounds fairly promising to me.

I think at this point, a longer duration smoke test would be in order.  If we sustain that throughput then I think we could say we are safe for migration.
We've brought up additional interfaces temporarily on the 6 collectors that reside in the metrics network (vlan100) in phx.  So they have direct access now and we can test "without fw or zeus".

There is no load balancing of the thrift nodes, but if you want a simple distribution round robin I've setup hp-noderr.phx1.mozilla.com which has the first 9 in them.
(Reporter)

Comment 21

7 years ago
(In reply to comment #19)
> 
> I think at this point, a longer duration smoke test would be in order.  If we
> sustain that throughput then I think we could say we are safe for migration.

Ok.  Tell us what you want to test.
Just a long running high volume test similar to our peak rates of 2.2k per minute.
Is having Zeus (or some load balancer) a blocker to Jan 22?  Is it acceptable to roll to prod/PHX without that while we do more investigations?
With Zeus:
14774 crashes submitted out of 240k in 1 hour (aborted)
--
Average insert time per minute, per node (10 nodes):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_insert_time.png
Average crashes submitted per minutes, per node (10):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_inserts.png
--

Without Zeus:
239441 crashes submitted out of 240k in 30 minutes
(~2% error rate expected because of https://wiki.mozilla.org/Socorro/Pre-PHX_Smoketest_Schedule#known_problems)
--
Average insert time per minute, per node (10 nodes):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/avg_insert_time-no_zeus.png
Average crashes submitted per minutes, per node (10):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/avg_inserts-no_zeus.png
--

You can find the all raw data in:
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/
(Assignee)

Comment 25

7 years ago
Going to set this to resolved.  Obviously if we find new things which will improve performance further we'll try them out at that time.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.