Closed Bug 1289830 Opened 8 years ago Closed 8 years ago

Webhead at 100% CPU due to high volume of requests from ActiveData-ETL

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(3 files)

The ActiveData-ETL user agent has made ~1 million requests to Treeherder's API in 12 hours.

Just from one webhead (out of 3):

[emorley@treeherder1.webapp.scl3 ~]$ awk -F\" '{print $6}' /var/log/httpd/treeherder.*.org/access_log | sort
| uniq -c | sort -nr | head -n 10
 301181 ActiveData-ETL
  68254 ouija
  18852 HTTP-Monitor/1.1
  14325 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0
  10451 Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0
   9924 Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0
   7439 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0
   6385 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
   4637 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
   3518 Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36

CPU is at 100% on two of our 3 webheads as a result.

Kyle, is this expected?
Either way, can you stop the requests for now please?

We really need to implement bug 1203556...
Flags: needinfo?(klahnakoski)
Web transactions overview for the last 12 hours:
https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?tw%5Bend%5D=1469638008&tw%5Bstart%5D=1469594808

The spike started at ~11am UTC+0 then 1pm UTC+0, until now.

Example CPU usage on one of the prod webheads:
https://rpm.newrelic.com/accounts/677903/servers/4302251?tw%5Bend%5D=1469638126&tw%5Bstart%5D=1469594926

The volume of requests caused gunicorn worker starvation, which increased the queueing time for other requests as well.
Scaled back ActiveData-ETL.

The ES cache should be full now, and the request spikes should be rare.  I will stand up a service that will funnel all TH requests, and rate-limit them.
Flags: needinfo?(klahnakoski)
Thank you :-)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
ActiveData has made ~420,000 requests to out API in the last 1 hr 45 minutes, and is causing responsiveness issues for other users of the app, eg:

23:38 <KWierso> hmm, everything with treeherder seems slow right now...
23:38 <KWierso> reloading the page is taking forever
23:39 <KWierso> loading a log for a job is taking forever

Kyle, please can you take a look?
Status: RESOLVED → REOPENED
Flags: needinfo?(klahnakoski)
Resolution: FIXED → ---
ES, which serves as a cache for TH data, lost a major node, and it took quite a while for the cluster to recover.  During that time all the ETL processes defaulted to using TH.

I have a proposed fix [1], but it is only a heuristic. I would prefer to implement a real velocity limit.  I will continue to look at this tomorrow.

[1] https://github.com/klahnakoski/ActiveData-ETL/commit/560dc702cfaaef319e525d6abc41b1228e8c9795
Flags: needinfo?(klahnakoski)
This is happening again now:
* request queueing times through the roof
* New Relic alerts about CPU usage on the webheads
* Up to 5000-6000 requests per minute from ActiveData alone
* 1.7 million requests in the last ~8 hours from ActiveData

Unfortunately I really don't have enough spare cycles to stop to deal with this a couple of times a week. As such, I think we're just going to have to block ActiveData's user agent again, but this time not remove the block until there have been discussions about a better way of extracting this information. (New APIs, read only DB replicas etc)

Longer term Treeherder will use rate limits on the API so we don't need to single out individual UAs.
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/7c61fb11d652c6010036220ed61668b5c11448a0
Bug 1289830 - Block ActiveData's user agent due to excessive requests

ActiveData's scraping of Treeherder's API has caused responsiveness and
performance issues for other users of Treeherder on several occasions,
so is being blocked until we can decide upon a less detrimental way for
ActiveData to obtain this data.
The block has been deployed to production.

Kyle, I think the best way forwards is:
* Start a newsgroup discussion in tools.treeherder about:
 - the types of data you're trying to scrape
 - why you need each piece of data (we've tried to ask this at various points but didn't really receive responses that allowed us to suggest alternative approaches) and how roughly many people are using the features it facilitates
 - how recent each data type needs to be (ie can we give DB snapshots etc)
 - what other sources you've looked into for fetching various pieces of data (eg I know one of the things that is fetched from Treeherder is the taskcluster task ID, but perhaps this is best taken from the Taskcluster Pulse stream)
* Pending outcome of the discussion, bugs can be filed against ActiveData (for changes in approach/caching/...) and Treeherder (eg adding a new API endpoint that reduces the calls you have to make)
* Once those bugs are resolved, we can file a new Treeherder bug to unblock the ActiveData user agent

Many thanks :-)
Assignee: nobody → emorley
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Summary from ActiveData side:

ActiveData uses its ES cluster to coordinate the nodes, and not ask TH for the same information twice.  This worked as rate limiter because ES was simply not fast enough to handle the 500 requests per second required by the workers.  This slowed all ETL processing down, and slowed all Treeherder requests down to something TH could handle.  Unfortunately, that was not fast enough, and the past few days have been spent optimizing the ES cluster to respond faster.  The optimization was successful, and allowed processes to work faster, but it revealed that preventing duplicate requests was not a good rate limiter.

This morning I finished code that counts the number of requests made by the workers, and prevents more requests [1].  But this is just another attempt to reduce TH request rate, and requires stress trials before I am certain it works.  Furthermore, there is a chance TH can not even serve all it ingests, with over 200K jobs per day [2].

The biggest spikes

Jul 27 06:00  1300ms - Initial implementation, and open of this bug
Aug 17 15:30  3000ms - ETL code did not deal with ES cluster node loss properly, all requests sent to Treeherder
Aug 18 12:50  1000ms - Bad fix pushed to workers, caught before it went even higher.      
Aug 19 04:30  1500ms - Attempt to increase workers to 60%, saw problem, but emorley blocked while I was killing workers.

[1] https://github.com/klahnakoski/ActiveData-ETL/commit/de446161ac5d641288f947bd819d23b658b65895
[2] http://activedata.allizom.org/tools/query.html#query_id=jPZ2DS7W
Long range view.  Chart averaging underestimates the height of the spikes.
Past day, attempts at scaling workers to test "easy fixes" under load.  None worked.
Just for clarity:  The code for collecting TH information is still active in the ETL processes.  If the block is cleared, any new ETL machines brought online will making TH requests.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: