Closed
Bug 1139897
Opened 10 years ago
Closed 10 years ago
Load test pipeline Heka
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: sphilp, Assigned: sphilp)
Details
Running many requests (curls) against a local datapipeline instance causes it to crash with a malloc error. Anecdotally it seems to handle 20-30 requests okay but above that it starts to choke.
Steps:
- Start a datapipeline instance on your local machine with the basiclocalconfig:
$ build/heka/build/heka/bin/hekad -config examples/basic_local_pipeline.toml
- Note the IP if you have changed this, default is localhost:8080
- Fire off 20 requests to make sure everything is fine:
for f in $(seq 1 20); do
curl -X POST "http://localhost:8080/submit/test/$f/foo/bar/baz" -d "{\"test\":$f}"
done
- Increase the requests to 100 or more:
for f in $(seq 1 200); do
...
- The datapipeline server should crash, your curl command should show output like:
curl: (7) Failed to connect to localhost port 8080: Connection refused
and the server should show a stacktrace like:
http://pastebin.com/KmCFMHbr
I tried restarting the server and running again, it happens every time.
I also tried it on pipeline AWS dev:
https://pipeline-prototype-incoming.dev.mozaws.net/submit/NAMESPACE/$f/foo/bar/baz
It doesn't seem to happen there but I noticed that the requests per second is much lower there than it is locally. I will do a proper load test on the AWS instance and post to this bug if I can reproduce it there when I have several hosts hitting it at once (to increase the rps).
Comment 1•10 years ago
|
||
I cannot reproduce this on OSX.
I am running Go v. 1.3.3, it appears you are running 1.4.2 - could this be the problem?
Assignee | ||
Comment 2•10 years ago
|
||
Go v1.3.3 appears to have fixed it. Working at least up to 2000 requests/100 rps. I'll move forward with the heavier load test
Updated•10 years ago
|
Assignee: nobody → sphilp
Priority: -- → P1
Summary: DataPipeline instance crashes locally when handling more than a few requests simultaneously → Load test pipeline Heka
Assignee | ||
Comment 3•10 years ago
|
||
Load test is waiting for new staging env to be set up since bug 1139614 is re-purposing the dev env. Staging is delayed due to an aws issue that should hopefully be resolved next week.
Comment 4•10 years ago
|
||
:sphilip are you unblocked on this now?
Assignee | ||
Comment 5•10 years ago
|
||
:mreid yup, starting testing today
Assignee | ||
Comment 6•10 years ago
|
||
Some early results:
Very light test (make sure it works)
Concurrency Level: 4
Complete requests: 100
Failed requests: 0
Non-2xx responses: 0
Total Transferred: 0 bytes
Requests per second: 14.10 [#/sec] (mean)
Time per request: 280.000 [ms] (mean)
50% response time: 280 [ms] (mean)
75% response time: 290 [ms] (mean)
90% response time: 305 [ms] (mean)
95% response time: 307 [ms] (mean)
99% response time: 312 [ms] (mean)
Looks fine...
Moderate test
Concurrency Level: 63
Complete requests: 4992
Failed requests: 0
Non-2xx responses: 0
Total Transferred: 0 bytes
Requests per second: 223.35 [#/sec] (mean)
Time per request: 280.000 [ms] (mean)
50% response time: 280 [ms] (mean)
75% response time: 300 [ms] (mean)
90% response time: 310 [ms] (mean)
95% response time: 310 [ms] (mean)
99% response time: 320 [ms] (mean)
Still good.
Heavy test
Concurrency Level: 302
Complete requests: 97344
Failed requests: 2496
Non-2xx responses: 0
Total Transferred: 0 bytes
Requests per second: 533.86 [#/sec] (mean)
Time per request: 567.500 [ms] (mean)
50% response time: 500 [ms] (mean)
75% response time: 527 [ms] (mean)
90% response time: 552 [ms] (mean)
95% response time: 642 [ms] (mean)
99% response time: 5307 [ms] (mean)
Notice the average response climbs to 500ms from the ~300ms baseline and the 99th percentile is well over 5 seconds (effectively not returned in a timely manner). There were 2000+ failed requests.
I will follow up after more investigation
Comment 7•10 years ago
|
||
What size request payloads were used did the approximate the size of the telemetry messages?
Assignee | ||
Comment 8•10 years ago
|
||
:trink, They were much smaller than typical telemetry messages. Just a small amount of test data (this first run was just to see how far I could push the server).
You can see it on s3 at:
net-mozaws-stage-us-west-2-pipeline-data/telemetry/20150313/telemetry/UNKNOWN/saved_session/Firefox/nightly/38.0a1
Assignee | ||
Comment 9•10 years ago
|
||
Talked with
Comment 10•10 years ago
|
||
I suspect that we're hitting issues when the number of concurrent HTTP requests consistently exceeds the number of available packs, since that means some requests can't be completed until packs make it all the way through the pipeline to be recycled. I recommended multiplying both poolsize and plugin_chansize settings by 4 to see if that impacts latency and failed request issues at concurrency 300.
Assignee | ||
Comment 11•10 years ago
|
||
... Talked with :rmiller, upped the poolsize to 400 and plugin_chansize to 120 for hekad on the staging edge config, the errors went away at 500 concurrency. Ramped up the load test further to 1500 concurrent:
Concurrency Level: 1348
Complete requests: 298976
Failed requests: 112
Non-2xx responses: 0
Total Transferred: 0 bytes
Requests per second: 1014.25 [#/sec] (mean)
Time per request: 1330.000 [ms] (mean)
50% response time: 601 [ms] (mean)
75% response time: 2555 [ms] (mean)
90% response time: 3240 [ms] (mean)
95% response time: 3340 [ms] (mean)
99% response time: 3605 [ms] (mean)
The server started to choke at this point, /var/log/messages shows:
Mar 18 20:21:58 ip-172-31-47-65 collectd[9020]: write_http plugin: curl_easy_perform failed with status 28: SSL connection timeout
Mar 18 20:22:44 ip-172-31-47-65 collectd[9020]: write_http plugin: curl_easy_perform failed with status 28: SSL connection timeout
Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: hekad-edge.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: Unit hekad-edge.service entered failed state.
Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: hekad-edge.service holdoff time over, scheduling restart.
Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: Stopping hekad-edge...
Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: Starting hekad-edge...
It recovered itself, so that's nice :)
Ran the load test again, still fails but this time complaining about memory (only 152mb free at this point):
Mar 18 20:33:26 ip-172-31-47-65 dhclient[2255]: DHCPREQUEST on eth0 to 172.31.32.1 port 67 (xid=0x4ee39782)
Mar 18 20:33:29 ip-172-31-47-65 dhclient[2255]: DHCPACK from 172.31.32.1 (xid=0x4ee39782)
Mar 18 20:33:32 ip-172-31-47-65 dhclient[2255]: fork: Cannot allocate memory
Mar 18 20:33:33 ip-172-31-47-65 dhclient[2255]: bound to 172.31.47.65 -- renewal in 1393 seconds.
Mar 18 20:33:48 ip-172-31-47-65 collectd[9020]: write_http plugin: curl_easy_perform failed with status 28: Operation timed out after 18788 milliseconds with 0 out of 0 bytes received
Assignee | ||
Comment 12•10 years ago
|
||
I terminated the old edge servers since they were locking up and played with the poolsize to find the appropriate concurrency per edge server. It looks like we can handle about 400 concurrent requests per c3.large. We're running 3 servers on staging so that gets us a bit over 1000 req/second. 1000rps is about 86 million request per day, which sounds like more than we are expecting as telemetry is going to be opt-in.
Everything seems good at this load, if we need to crank it up though there will need to be more testing since things start to flake out beyond this level.
The main thing to take away here is # of concurrent requests / # of heka instances (edge servers) determines the poolsize. If the # of requests is > the poolsize heka will choke. This could be handled more gracefully in code, but for now we need to make sure the poolsize is set high enough, spin up more instances, and/or use larger instances.
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 13•10 years ago
|
||
To clarify that a bit, if the # of CONCURRENT requests PER SERVER is > the poolsize that is set on that server, heka will choke.
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•