Closed
Bug 625853
Opened 14 years ago
Closed 13 years ago
HBase insert throughput is too slow (PHX)
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
1.7.6
People
(Reporter: laura, Assigned: xstevens)
References
Details
Attachments
(2 files)
3.41 KB,
patch
|
Details | Diff | Splinter Review | |
3.69 KB,
patch
|
Details | Diff | Splinter Review |
Filing this as a place for investigations. Smoke testing reveals throughput in PHX is too slow for production. From rhelmer's report: Executive summary: Inserting to HBase is our bottleneck for receiving crashes. It is far slower than expected, and likely indicates a problem with the setup in PHX. We do not know if this is Zeus, the network, or something else, and could use help from Metrics and IT on this. Possible next step: test with collectors communicating to hbase/thrift without Zeus to eliminate this as possibility (if nothing else obvious jumps out). -- 2011-01-13 16:41:28 started 10 node test, 24k crashes each (inserting as quickly as possible) collector is storing straight to hbase. -- 2011-01-13 17:47:11 only 14774 crashes inserted; we stop the test because this is far too slow avg crashes per minute: ~20 avg seconds to insert crash: ~10 -- 2011-01-13 18:00:44 at jberkus' suggestion, we try enabling localFS storage system on collectors. collector is storing straight to local disk. -- 2011-01-13 18:14:28 test complete, all 240k crashes on 10 nodes inserted. avg crashes per minute: ~1700 avg seconds to insert crash: ~0.12s -- 2011-01-13 19:47:56 collectors have inserted all crashes from local disk to hbase -- Ganglia has more details, but here are a couple graphs per-node (need MPT-VPN): http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_inserts.png http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_insert_time.png
Reporter | ||
Updated•14 years ago
|
Assignee: deinspanjer → xstevens
Comment 1•14 years ago
|
||
A cursory view of zeus doesn't show any errors or resource starvation.. Ditto when spot-checking yesterday's stats on a node (hp-node01). asking for another quick test on irc so I can take a look in real time.
Assignee | ||
Comment 2•14 years ago
|
||
So lars gathered some benchmarks from the python side. Basically two types of calls 1 for atomicIncrement on metrics table the rest are Puts via Thrift. I did some quick analysis on the log data he gave and here it is (V3 is the time it took): > summary(atomics$V3) Min. 1st Qu. Median Mean 3rd Qu. Max. 1208 3523 6008 12550 14450 346100 > summary(mutates$V3) Min. 1st Qu. Median Mean 3rd Qu. Max. 1660 6036 7410 17760 13110 3079000 Trying to get clarification if those times are in millis or nanos? If its millis though this is horribly slow.
Comment 3•14 years ago
|
||
for the benefit of those that didn't see my response on IRC, the numbers are in microseconds
Assignee | ||
Comment 4•14 years ago
|
||
Okay so if these are in microseconds. Where are we slowing down at? Obviously the max is of some concern here but the 3rd quartile is looking pretty good.
Comment 5•14 years ago
|
||
Can we have more detail on those numbers? They don't match at all what we were seeing in the full system test. What's different? How were they generated?
Comment 6•14 years ago
|
||
here's where those numbers came from. The python code uses an hbase client library. A single insert of a crash boils down to being a collection of calls into the hbase client library. There are two calls: atomicIncrement and mutateRow. Each crash insert is about three mutateRow calls and a lot of atomicIncrements. At xstevens request, I benchmarked those calls. Each call logs the amount of time that it took.
Assignee | ||
Comment 7•14 years ago
|
||
So the new numbers from this evenings test are: > summary(mutates$V3) Min. 1st Qu. Median Mean 3rd Qu. Max. 1617 2404 2622 8804 9916 2025000 > summary(atomics$V3) Min. 1st Qu. Median Mean 3rd Qu. Max. 1229 1761 1988 2380 2302 478800 > summary(jsonputs$V3) Min. 1st Qu. Median Mean 3rd Qu. Max. 32230 58000 64740 198100 78080 16240000 > quantile(jsonputs$V3, c(0.75, 0.80, 0.85, 0.90, 0.95, 0.99)) 75% 80% 85% 90% 95% 99% 78082.75 90072.00 108685.75 125745.70 162498.45 634906.36 These numbers look much much better. Median put_json_dump time improved by 74% and the third quantile improved by 78.85%.
Assignee | ||
Comment 8•14 years ago
|
||
Just so everyone knows this improvement was due to making some hadoop/hbase config changes. I audited our configs earlier and realized we had some key things missing between SJC and PHX. This was largely because I used research as a template for PHX because they were CDH3 configs, but research doesn't use the same parameters as production does.
Comment 9•14 years ago
|
||
I should note that xstevens posting is from the low pressure test. I will get the high pressure test data to him soon. we did two load tests today - one high pressure and one low pressure. I wanted to find out how the benchmark times in change in high vs low load. I'm also interested in the ratio between the total benchmark time of the 'put_json_dump' call is to the sum of the 'mutate' and 'atomicIncrement' parts. During this analysis, I discovered that HBase timeouts happen nearly precisely every 30 seconds on every thread in every test. It doesn't matter if it is a high load or low load. I verified that this happened in yesterdays tests, too. The connections do not timeout simultaneously. I don't know what this means, but is sure is interesting. On my ratio investigation, eyeballing it suggests that the single thread test had a better ratio of benchmarked put_json to the sum of the parts. The higher lead tests seemed to be farther off. I'm wondering if we're seeing threads being blocked by python's threading woes. That would account for the discrepancy. The counter argument is that the straight collector to hbase test are also being slow. We should retest collector direct submitting with this benchmarking to see the differences. My next step is to massage the datafile from todays tests so that we can more precisely analyze the change in the ratio between the low and high pressure tests. So we've got multiple vectors to investigate here. If anyone has opinions as to why some investigations are likely dead ends or unfruitful, please speak up.
Comment 10•14 years ago
|
||
(In reply to comment #8) > I audited our configs earlier and realized we had some key > things missing between SJC and PHX. @jabba - you'll make sure this gets reflected in puppet?
Comment 11•14 years ago
|
||
> During this analysis, I discovered that HBase timeouts happen nearly precisely
> every 30 seconds on every thread in every test. It doesn't matter if it is a
> high load or low load. I verified that this happened in yesterdays tests, too.
> The connections do not timeout simultaneously. I don't know what this means,
> but is sure is interesting.
I'd like to rule out networking - does someone have a doc or diagram of the network data flow and which hosts or pieces of hardware traffic touches?
For instance, could be some session state issue in Zeus that resets every 30 seconds but I don't know how Zeus ties into this or the namespace nodes (different than sjc namespace nodes).
Assignee | ||
Comment 12•14 years ago
|
||
mrz, the config issue doesn't have anything to do with puppet problems. We push different configs for different clusters.
Comment 13•14 years ago
|
||
(In reply to comment #11) > For instance, could be some session state issue in Zeus that resets every 30 > seconds but I don't know how Zeus ties into this or the namespace nodes > (different than sjc namespace nodes). In zeus we have a 10 sec timeout on new connections without data, and 300 second timeout on idle connections. If anyone would like to test again this morning I've gone ahead and removed both for socorro-thrift1 since I will be leaving for church soon. If this does not make a difference I would like to put them back.
Reporter | ||
Comment 14•14 years ago
|
||
(In reply to comment #11) > > During this analysis, I discovered that HBase timeouts happen nearly precisely > > every 30 seconds on every thread in every test. It doesn't matter if it is a > > high load or low load. I verified that this happened in yesterdays tests, too. > > The connections do not timeout simultaneously. I don't know what this means, > > but is sure is interesting. > > I'd like to rule out networking - does someone have a doc or diagram of the > network data flow and which hosts or pieces of hardware traffic touches? > https://wiki.mozilla.org/Socorro_Communication_Paths
Comment 15•14 years ago
|
||
(In reply to comment #13) > In zeus we have a 10 sec timeout on new connections without data, and 300 > second timeout on idle connections. > > If anyone would like to test again this morning I've gone ahead and removed > both for socorro-thrift1 since I will be leaving for church soon. > > If this does not make a difference I would like to put them back. This did not make a difference and has been reverted. Additional info: 11:23 < lars> the timeout is coming to me as an exception socket.timeout 11:25 < lars> it's being raised from within the hbase.client api
Comment 16•14 years ago
|
||
Comment 17•14 years ago
|
||
Assignee | ||
Comment 18•14 years ago
|
||
Referring back to the numbers in comment #7: At 3rd Quartile That's 768 crashes/minute per collector I believe.
Comment 19•14 years ago
|
||
Given that our highest peak of crashes per minute is 2.2k, that means that with just 3 collectors, we could reliably handle this load. That sounds fairly promising to me. I think at this point, a longer duration smoke test would be in order. If we sustain that throughput then I think we could say we are safe for migration.
Comment 20•14 years ago
|
||
We've brought up additional interfaces temporarily on the 6 collectors that reside in the metrics network (vlan100) in phx. So they have direct access now and we can test "without fw or zeus". There is no load balancing of the thrift nodes, but if you want a simple distribution round robin I've setup hp-noderr.phx1.mozilla.com which has the first 9 in them.
Reporter | ||
Comment 21•14 years ago
|
||
(In reply to comment #19) > > I think at this point, a longer duration smoke test would be in order. If we > sustain that throughput then I think we could say we are safe for migration. Ok. Tell us what you want to test.
Comment 22•14 years ago
|
||
Just a long running high volume test similar to our peak rates of 2.2k per minute.
Comment 23•14 years ago
|
||
Is having Zeus (or some load balancer) a blocker to Jan 22? Is it acceptable to roll to prod/PHX without that while we do more investigations?
Comment 24•14 years ago
|
||
With Zeus: 14774 crashes submitted out of 240k in 1 hour (aborted) -- Average insert time per minute, per node (10 nodes): http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_insert_time.png Average crashes submitted per minutes, per node (10): http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/avg_inserts.png -- Without Zeus: 239441 crashes submitted out of 240k in 30 minutes (~2% error rate expected because of https://wiki.mozilla.org/Socorro/Pre-PHX_Smoketest_Schedule#known_problems) -- Average insert time per minute, per node (10 nodes): http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/avg_insert_time-no_zeus.png Average crashes submitted per minutes, per node (10): http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/avg_inserts-no_zeus.png -- You can find the all raw data in: http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run1/ http://khan.mozilla.org/~rhelmer/socorro_2011_phx_smoketest/run16/
Assignee | ||
Comment 25•13 years ago
|
||
Going to set this to resolved. Obviously if we find new things which will improve performance further we'll try them out at that time.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•