Bugzilla

Comment 1

•

14 years ago

A cursory view of zeus doesn't show any errors or resource starvation.. Ditto when spot-checking yesterday's stats on a node (hp-node01).

asking for another quick test on irc so I can take a look in real time.

Assignee

Comment 2

•

14 years ago

So lars gathered some benchmarks from the python side.  Basically two types of calls 1 for atomicIncrement on metrics table the rest are Puts via Thrift.  I did some quick analysis on the log data he gave and here it is (V3 is the time it took):

> summary(atomics$V3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1208    3523    6008   12550   14450  346100 
> summary(mutates$V3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1660    6036    7410   17760   13110 3079000 

Trying to get clarification if those times are in millis or nanos?  If its millis though this is horribly slow.

Comment 3

•

14 years ago

for the benefit of those that didn't see my response on IRC, the numbers are in microseconds

Assignee

Comment 4

•

14 years ago

Okay so if these are in microseconds.  Where are we slowing down at?  Obviously the max is of some concern here but the 3rd quartile is looking pretty good.

[:jberkus] Josh Berkus

Comment 5

•

14 years ago

Can we have more detail on those numbers?  They don't match at all what we were seeing in the full system test.  What's different?  How were they generated?

Comment 6

•

14 years ago

here's where those numbers came from.  The python code uses an hbase client library.  A single insert of a crash boils down to being a collection of calls into the hbase client library. There are two calls: atomicIncrement and mutateRow.  Each crash insert is about three mutateRow calls and a lot of atomicIncrements.  At xstevens request, I benchmarked those calls.  Each call logs the amount of time that it took.

Assignee

Comment 7

•

14 years ago

So the new numbers from this evenings test are:

> summary(mutates$V3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1617    2404    2622    8804    9916 2025000 
> summary(atomics$V3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1229    1761    1988    2380    2302  478800 
> summary(jsonputs$V3)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   32230    58000    64740   198100    78080 16240000
> quantile(jsonputs$V3, c(0.75, 0.80, 0.85, 0.90, 0.95, 0.99))
      75%       80%       85%       90%       95%       99% 
 78082.75  90072.00 108685.75 125745.70 162498.45 634906.36

These numbers look much much better.  Median put_json_dump time improved by 74% and the third quantile improved by 78.85%.

Assignee

Comment 8

•

14 years ago

Just so everyone knows this improvement was due to making some hadoop/hbase config changes.  I audited our configs earlier and realized we had some key things missing between SJC and PHX.  This was largely because I used research as a template for PHX because they were CDH3 configs, but research doesn't use the same parameters as production does.

Comment 9

•

14 years ago

I should note that xstevens posting is from the low pressure test. I will get the high pressure test data to him soon.

we did two load tests today - one high pressure and one low pressure. I wanted to find out how the benchmark times in change in high vs low load. I'm also interested in the ratio between the total benchmark time of the 'put_json_dump' call is to the sum of the 'mutate' and 'atomicIncrement' parts.

During this analysis, I discovered that HBase timeouts happen nearly precisely every 30 seconds on every thread in every test. It doesn't matter if it is a high load or low load. I verified that this happened in yesterdays tests, too. The connections do not timeout simultaneously. I don't know what this means, but is sure is interesting.

On my ratio investigation, eyeballing it suggests that the single thread test had a better ratio of benchmarked put_json to the sum of the parts. The higher lead tests seemed to be farther off. I'm wondering if we're seeing threads being blocked by python's threading woes. That would account for the discrepancy. The counter argument is that the straight collector to hbase test are also being slow. We should retest collector direct submitting with this benchmarking to see the differences.

My next step is to massage the datafile from todays tests so that we can more precisely analyze the change in the ratio between the low and high pressure tests.

So we've got multiple vectors to investigate here. If anyone has opinions as to why some investigations are likely dead ends or unfruitful, please speak up.

matthew zeier [:mrz]

Comment 10

•

14 years ago

(In reply to comment #8)
> I audited our configs earlier and realized we had some key
> things missing between SJC and PHX.

@jabba - you'll make sure this gets reflected in puppet?

matthew zeier [:mrz]

Comment 11

•

14 years ago

> During this analysis, I discovered that HBase timeouts happen nearly precisely
> every 30 seconds on every thread in every test.  It doesn't matter if it is a
> high load or low load.  I verified that this happened in yesterdays tests, too.
>  The connections do not timeout simultaneously.  I don't know what this means,
> but is sure is interesting.

I'd like to rule out networking - does someone have a doc or diagram of the network data flow and which hosts or pieces of hardware traffic touches?  

For instance, could be some session state issue in Zeus that resets every 30 seconds but I don't know how Zeus ties into this or the namespace nodes (different than sjc namespace nodes).

Assignee

Comment 12

•

14 years ago

mrz, the config issue doesn't have anything to do with puppet problems.  We push different configs for different clusters.

Comment 13

•

14 years ago

(In reply to comment #11)
> For instance, could be some session state issue in Zeus that resets every 30
> seconds but I don't know how Zeus ties into this or the namespace nodes
> (different than sjc namespace nodes).

In zeus we have a 10 sec timeout on new connections without data, and 300 second timeout on idle connections.

If anyone would like to test again this morning I've gone ahead and removed both for socorro-thrift1 since I will be leaving for church soon.

If this does not make a difference I would like to put them back.

Laura Thomson :laura

Reporter

Comment 14

•

14 years ago

(In reply to comment #11)
> > During this analysis, I discovered that HBase timeouts happen nearly precisely
> > every 30 seconds on every thread in every test.  It doesn't matter if it is a
> > high load or low load.  I verified that this happened in yesterdays tests, too.
> >  The connections do not timeout simultaneously.  I don't know what this means,
> > but is sure is interesting.
> 
> I'd like to rule out networking - does someone have a doc or diagram of the
> network data flow and which hosts or pieces of hardware traffic touches?  
> 

https://wiki.mozilla.org/Socorro_Communication_Paths

Comment 15

•

14 years ago

(In reply to comment #13)
> In zeus we have a 10 sec timeout on new connections without data, and 300
> second timeout on idle connections.
> 
> If anyone would like to test again this morning I've gone ahead and removed
> both for socorro-thrift1 since I will be leaving for church soon.
> 
> If this does not make a difference I would like to put them back.

This did not make a difference and has been reverted.  Additional info:

11:23 < lars> the timeout is coming to me as an exception socket.timeout
11:25 < lars> it's being raised from within the hbase.client api

Comment 16

•

14 years ago

Attached patch hbase benchmarking patch for Socorro 1.7.6 — Details — Splinter Review

Comment 17

•

14 years ago

Attached patch hbase benchmarking patch for Socorro 1.7.5.5 — Details — Splinter Review

Daniel Einspanjer [:dre] [:deinspanjer]

Assignee

Comment 18

•

14 years ago

Referring back to the numbers in comment #7:

At 3rd Quartile That's 768 crashes/minute per collector I believe.

Comment 19

•

14 years ago

Given that our highest peak of crashes per minute is 2.2k, that means that with just 3 collectors, we could reliably handle this load.  That sounds fairly promising to me.

I think at this point, a longer duration smoke test would be in order.  If we sustain that throughput then I think we could say we are safe for migration.

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 20

•

14 years ago

We've brought up additional interfaces temporarily on the 6 collectors that reside in the metrics network (vlan100) in phx.  So they have direct access now and we can test "without fw or zeus".

There is no load balancing of the thrift nodes, but if you want a simple distribution round robin I've setup hp-noderr.phx1.mozilla.com which has the first 9 in them.

Laura Thomson :laura

Reporter

Comment 21

•

14 years ago

(In reply to comment #19)
> 
> I think at this point, a longer duration smoke test would be in order.  If we
> sustain that throughput then I think we could say we are safe for migration.

Ok.  Tell us what you want to test.

Comment 22

•

14 years ago

Just a long running high volume test similar to our peak rates of 2.2k per minute.

matthew zeier [:mrz]

Comment 23

•

14 years ago

Is having Zeus (or some load balancer) a blocker to Jan 22?  Is it acceptable to roll to prod/PHX without that while we do more investigations?

Robert Helmer [:rhelmer]

Comment 24

•

14 years ago

With Zeus:
14774 crashes submitted out of 240k in 1 hour (aborted)
--
Average insert time per minute, per node (10 nodes):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_insert_time.png
Average crashes submitted per minutes, per node (10):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_inserts.png
--

Without Zeus:
239441 crashes submitted out of 240k in 30 minutes
(~2% error rate expected because of https://wiki.mozilla.org/Socorro/Pre-PHX_Smoketest_Schedule#known_problems)
--
Average insert time per minute, per node (10 nodes):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/avg_insert_time-no_zeus.png
Average crashes submitted per minutes, per node (10):
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/avg_inserts-no_zeus.png
--

You can find the all raw data in:
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/
http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/