Closed Bug 1289830 Opened 9 years ago Closed 9 years ago

Webhead at 100% CPU due to high volume of requests from ActiveData-ETL

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(3 files)

[treeherder] mozilla:block-activedata > mozilla:master 9 years ago GitHub Autolander Bot 47 bytes, text/x-github-pull-request		Details \| Review
8-19-2016 9-31-28 AM.png 9 years ago Kyle Lahnakoski [:ekyle] 51.09 KB, image/png		Details
8-19-2016 9-33-07 AM.png 9 years ago Kyle Lahnakoski [:ekyle] 61.92 KB, image/png		Details

Ed Morley [:emorley]

Assignee

Description

•

9 years ago

The ActiveData-ETL user agent has made ~1 million requests to Treeherder's API in 12 hours. Just from one webhead (out of 3): [emorley@treeherder1.webapp.scl3 ~]$ awk -F\" '{print $6}' /var/log/httpd/treeherder.*.org/access_log | sort | uniq -c | sort -nr | head -n 10 301181 ActiveData-ETL 68254 ouija 18852 HTTP-Monitor/1.1 14325 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0 10451 Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0 9924 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0 7439 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0 6385 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0 4637 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36 3518 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36 CPU is at 100% on two of our 3 webheads as a result. Kyle, is this expected? Either way, can you stop the requests for now please? We really need to implement bug 1203556...

Flags: needinfo?(klahnakoski)

Ed Morley [:emorley]

Assignee

Comment 1

•

9 years ago

Web transactions overview for the last 12 hours: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1469638008&tw%5Bstart%5D=1469594808 The spike started at ~11am UTC+0 then 1pm UTC+0, until now. Example CPU usage on one of the prod webheads: https://rpm.newrelic.com/accounts/677903/servers/4302251?tw%5Bend%5D=1469638126&tw%5Bstart%5D=1469594926 The volume of requests caused gunicorn worker starvation, which increased the queueing time for other requests as well.

Kyle Lahnakoski [:ekyle]

Comment 2

•

9 years ago

Scaled back ActiveData-ETL. The ES cache should be full now, and the request spikes should be rare. I will stand up a service that will funnel all TH requests, and rate-limit them.

Flags: needinfo?(klahnakoski)

Ed Morley [:emorley]

Assignee

Comment 3

•

9 years ago

Thank you :-)

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Assignee

Comment 4

•

9 years ago

ActiveData has made ~420,000 requests to out API in the last 1 hr 45 minutes, and is causing responsiveness issues for other users of the app, eg: 23:38 <KWierso> hmm, everything with treeherder seems slow right now... 23:38 <KWierso> reloading the page is taking forever 23:39 <KWierso> loading a log for a job is taking forever Kyle, please can you take a look?

Status: RESOLVED → REOPENED

Flags: needinfo?(klahnakoski)

Resolution: FIXED → ---

Kyle Lahnakoski [:ekyle]

Comment 5

•

9 years ago

ES, which serves as a cache for TH data, lost a major node, and it took quite a while for the cluster to recover. During that time all the ETL processes defaulted to using TH. I have a proposed fix [1], but it is only a heuristic. I would prefer to implement a real velocity limit. I will continue to look at this tomorrow. [1] https://github.com/klahnakoski/ActiveData-ETL/commit/560dc702cfaaef319e525d6abc41b1228e8c9795

Flags: needinfo?(klahnakoski)

Ed Morley [:emorley]

Assignee

Comment 6

•

9 years ago

This is happening again now: * request queueing times through the roof * New Relic alerts about CPU usage on the webheads * Up to 5000-6000 requests per minute from ActiveData alone * 1.7 million requests in the last ~8 hours from ActiveData Unfortunately I really don't have enough spare cycles to stop to deal with this a couple of times a week. As such, I think we're just going to have to block ActiveData's user agent again, but this time not remove the block until there have been discussions about a better way of extracting this information. (New APIs, read only DB replicas etc) Longer term Treeherder will use rate limits on the API so we don't need to single out individual UAs.

GitHub Autolander Bot

Comment 7

•

9 years ago

Attached file [treeherder] mozilla:block-activedata > mozilla:master — Details

Treeherder GitHub Bugbot

Comment 8

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/7c61fb11d652c6010036220ed61668b5c11448a0 Bug 1289830 - Block ActiveData's user agent due to excessive requests ActiveData's scraping of Treeherder's API has caused responsiveness and performance issues for other users of Treeherder on several occasions, so is being blocked until we can decide upon a less detrimental way for ActiveData to obtain this data.

Ed Morley [:emorley]

Assignee

Comment 9

•

9 years ago

The block has been deployed to production. Kyle, I think the best way forwards is: * Start a newsgroup discussion in tools.treeherder about: - the types of data you're trying to scrape - why you need each piece of data (we've tried to ask this at various points but didn't really receive responses that allowed us to suggest alternative approaches) and how roughly many people are using the features it facilitates - how recent each data type needs to be (ie can we give DB snapshots etc) - what other sources you've looked into for fetching various pieces of data (eg I know one of the things that is fetched from Treeherder is the taskcluster task ID, but perhaps this is best taken from the Taskcluster Pulse stream) * Pending outcome of the discussion, bugs can be filed against ActiveData (for changes in approach/caching/...) and Treeherder (eg adding a new API endpoint that reduces the calls you have to make) * Once those bugs are resolved, we can file a new Treeherder bug to unblock the ActiveData user agent Many thanks :-)

Assignee: nobody → emorley

Status: REOPENED → RESOLVED

Closed: 9 years ago → 9 years ago

Resolution: --- → FIXED

Kyle Lahnakoski [:ekyle]

Comment 10

•

9 years ago

Summary from ActiveData side: ActiveData uses its ES cluster to coordinate the nodes, and not ask TH for the same information twice. This worked as rate limiter because ES was simply not fast enough to handle the 500 requests per second required by the workers. This slowed all ETL processing down, and slowed all Treeherder requests down to something TH could handle. Unfortunately, that was not fast enough, and the past few days have been spent optimizing the ES cluster to respond faster. The optimization was successful, and allowed processes to work faster, but it revealed that preventing duplicate requests was not a good rate limiter. This morning I finished code that counts the number of requests made by the workers, and prevents more requests [1]. But this is just another attempt to reduce TH request rate, and requires stress trials before I am certain it works. Furthermore, there is a chance TH can not even serve all it ingests, with over 200K jobs per day [2]. The biggest spikes Jul 27 06:00 1300ms - Initial implementation, and open of this bug Aug 17 15:30 3000ms - ETL code did not deal with ES cluster node loss properly, all requests sent to Treeherder Aug 18 12:50 1000ms - Bad fix pushed to workers, caught before it went even higher. Aug 19 04:30 1500ms - Attempt to increase workers to 60%, saw problem, but emorley blocked while I was killing workers. [1] https://github.com/klahnakoski/ActiveData-ETL/commit/de446161ac5d641288f947bd819d23b658b65895 [2] http://activedata.allizom.org/tools/query.html#query_id=jPZ2DS7W

Kyle Lahnakoski [:ekyle]

Comment 11

•

9 years ago

Attached image 8-19-2016 9-31-28 AM.png — Details

Long range view. Chart averaging underestimates the height of the spikes.

Kyle Lahnakoski [:ekyle]

Comment 12

•

9 years ago

Attached image 8-19-2016 9-33-07 AM.png — Details

Past day, attempts at scaling workers to test "easy fixes" under load. None worked.

Kyle Lahnakoski [:ekyle]

Comment 13

•

9 years ago

Just for clarity: The code for collecting TH information is still active in the ETL processes. If the block is cleared, any new ETL machines brought online will making TH requests.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Webhead at 100% CPU due to high volume of requests from ActiveData-ETL

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Attachment

General

Description

File Name

Content Type