Closed Bug 1139897 Opened 10 years ago Closed 10 years ago

Load test pipeline Heka

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sphilp, Assigned: sphilp)

Details

Running many requests (curls) against a local datapipeline instance causes it to crash with a malloc error. Anecdotally it seems to handle 20-30 requests okay but above that it starts to choke. Steps: - Start a datapipeline instance on your local machine with the basiclocalconfig: $ build/heka/build/heka/bin/hekad -config examples/basic_local_pipeline.toml - Note the IP if you have changed this, default is localhost:8080 - Fire off 20 requests to make sure everything is fine: for f in $(seq 1 20); do curl -X POST "http://localhost:8080/submit/test/$f/foo/bar/baz" -d "{\"test\":$f}" done - Increase the requests to 100 or more: for f in $(seq 1 200); do ... - The datapipeline server should crash, your curl command should show output like: curl: (7) Failed to connect to localhost port 8080: Connection refused and the server should show a stacktrace like: http://pastebin.com/KmCFMHbr I tried restarting the server and running again, it happens every time. I also tried it on pipeline AWS dev: https://pipeline-prototype-incoming.dev.mozaws.net/submit/NAMESPACE/$f/foo/bar/baz It doesn't seem to happen there but I noticed that the requests per second is much lower there than it is locally. I will do a proper load test on the AWS instance and post to this bug if I can reproduce it there when I have several hosts hitting it at once (to increase the rps).
I cannot reproduce this on OSX. I am running Go v. 1.3.3, it appears you are running 1.4.2 - could this be the problem?
Go v1.3.3 appears to have fixed it. Working at least up to 2000 requests/100 rps. I'll move forward with the heavier load test
Assignee: nobody → sphilp
Priority: -- → P1
Summary: DataPipeline instance crashes locally when handling more than a few requests simultaneously → Load test pipeline Heka
Load test is waiting for new staging env to be set up since bug 1139614 is re-purposing the dev env. Staging is delayed due to an aws issue that should hopefully be resolved next week.
:sphilip are you unblocked on this now?
:mreid yup, starting testing today
Some early results: Very light test (make sure it works) Concurrency Level: 4 Complete requests: 100 Failed requests: 0 Non-2xx responses: 0 Total Transferred: 0 bytes Requests per second: 14.10 [#/sec] (mean) Time per request: 280.000 [ms] (mean) 50% response time: 280 [ms] (mean) 75% response time: 290 [ms] (mean) 90% response time: 305 [ms] (mean) 95% response time: 307 [ms] (mean) 99% response time: 312 [ms] (mean) Looks fine... Moderate test Concurrency Level: 63 Complete requests: 4992 Failed requests: 0 Non-2xx responses: 0 Total Transferred: 0 bytes Requests per second: 223.35 [#/sec] (mean) Time per request: 280.000 [ms] (mean) 50% response time: 280 [ms] (mean) 75% response time: 300 [ms] (mean) 90% response time: 310 [ms] (mean) 95% response time: 310 [ms] (mean) 99% response time: 320 [ms] (mean) Still good. Heavy test Concurrency Level: 302 Complete requests: 97344 Failed requests: 2496 Non-2xx responses: 0 Total Transferred: 0 bytes Requests per second: 533.86 [#/sec] (mean) Time per request: 567.500 [ms] (mean) 50% response time: 500 [ms] (mean) 75% response time: 527 [ms] (mean) 90% response time: 552 [ms] (mean) 95% response time: 642 [ms] (mean) 99% response time: 5307 [ms] (mean) Notice the average response climbs to 500ms from the ~300ms baseline and the 99th percentile is well over 5 seconds (effectively not returned in a timely manner). There were 2000+ failed requests. I will follow up after more investigation
What size request payloads were used did the approximate the size of the telemetry messages?
:trink, They were much smaller than typical telemetry messages. Just a small amount of test data (this first run was just to see how far I could push the server). You can see it on s3 at: net-mozaws-stage-us-west-2-pipeline-data/telemetry/20150313/telemetry/UNKNOWN/saved_session/Firefox/nightly/38.0a1
Talked with
I suspect that we're hitting issues when the number of concurrent HTTP requests consistently exceeds the number of available packs, since that means some requests can't be completed until packs make it all the way through the pipeline to be recycled. I recommended multiplying both poolsize and plugin_chansize settings by 4 to see if that impacts latency and failed request issues at concurrency 300.
... Talked with :rmiller, upped the poolsize to 400 and plugin_chansize to 120 for hekad on the staging edge config, the errors went away at 500 concurrency. Ramped up the load test further to 1500 concurrent: Concurrency Level: 1348 Complete requests: 298976 Failed requests: 112 Non-2xx responses: 0 Total Transferred: 0 bytes Requests per second: 1014.25 [#/sec] (mean) Time per request: 1330.000 [ms] (mean) 50% response time: 601 [ms] (mean) 75% response time: 2555 [ms] (mean) 90% response time: 3240 [ms] (mean) 95% response time: 3340 [ms] (mean) 99% response time: 3605 [ms] (mean) The server started to choke at this point, /var/log/messages shows: Mar 18 20:21:58 ip-172-31-47-65 collectd[9020]: write_http plugin: curl_easy_perform failed with status 28: SSL connection timeout Mar 18 20:22:44 ip-172-31-47-65 collectd[9020]: write_http plugin: curl_easy_perform failed with status 28: SSL connection timeout Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: hekad-edge.service: main process exited, code=exited, status=2/INVALIDARGUMENT Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: Unit hekad-edge.service entered failed state. Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: hekad-edge.service holdoff time over, scheduling restart. Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: Stopping hekad-edge... Mar 18 20:23:34 ip-172-31-47-65 systemd[1]: Starting hekad-edge... It recovered itself, so that's nice :) Ran the load test again, still fails but this time complaining about memory (only 152mb free at this point): Mar 18 20:33:26 ip-172-31-47-65 dhclient[2255]: DHCPREQUEST on eth0 to 172.31.32.1 port 67 (xid=0x4ee39782) Mar 18 20:33:29 ip-172-31-47-65 dhclient[2255]: DHCPACK from 172.31.32.1 (xid=0x4ee39782) Mar 18 20:33:32 ip-172-31-47-65 dhclient[2255]: fork: Cannot allocate memory Mar 18 20:33:33 ip-172-31-47-65 dhclient[2255]: bound to 172.31.47.65 -- renewal in 1393 seconds. Mar 18 20:33:48 ip-172-31-47-65 collectd[9020]: write_http plugin: curl_easy_perform failed with status 28: Operation timed out after 18788 milliseconds with 0 out of 0 bytes received
I terminated the old edge servers since they were locking up and played with the poolsize to find the appropriate concurrency per edge server. It looks like we can handle about 400 concurrent requests per c3.large. We're running 3 servers on staging so that gets us a bit over 1000 req/second. 1000rps is about 86 million request per day, which sounds like more than we are expecting as telemetry is going to be opt-in. Everything seems good at this load, if we need to crank it up though there will need to be more testing since things start to flake out beyond this level. The main thing to take away here is # of concurrent requests / # of heka instances (edge servers) determines the poolsize. If the # of requests is > the poolsize heka will choke. This could be handled more gracefully in code, but for now we need to make sure the poolsize is set high enough, spin up more instances, and/or use larger instances.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
To clarify that a bit, if the # of CONCURRENT requests PER SERVER is > the poolsize that is set on that server, heka will choke.
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.