1305768 - Periodic spikes in response time for requests from Treeherder for task logs through cloud-mirror

Hmm, could be... We still don't really have any idea what cause the excessive timeouts the last few days.. Maybe some connectivity issues. I'm talking to azure to try and get some insights. But I really fear this is going to go unexplained :(

Flags: needinfo?(jopsen)

Ed Morley [:emorley]

Reporter

Comment 7

•

8 years ago

Another spike 40 minutes ago, lasting for ~20 mins: https://rpm.newrelic.com/accounts/677903/applications/14179757/externals?tw%5Bend%5D=1476896131&tw%5Bstart%5D=1476892531#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d

Ed Morley [:emorley]

Reporter

Comment 8

•

8 years ago

This isn't a problem with AWS, but cloud-mirror. The spikes in request times always max out at 30s, which is cloud-mirror's cut-off for "giving up, redirect to the original source".

Ed Morley [:emorley]

Reporter

Comment 9

•

8 years ago

And another for 15 minutes just over an hour ago: https://rpm.newrelic.com/accounts/677903/applications/14179757/externals?tw%5Bend%5D=1476996669&tw%5Bstart%5D=1476991312#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d

Greg Arndt [:garndt]

Updated

•

8 years ago

Component: Queue → Tools

Summary: Periodic spikes in response time for requests to queue.taskcluster.net from AWS → Periodic spikes in response time for requests from Treeherder for task logs through cloud-mirror

Greg Arndt [:garndt]

Comment 10

•

8 years ago

We need to find a way for the two systems to work with each other in a better way. John, can you please work on possible short term and long term solutions to this issue and document them here?

Jonas Finnemann Jensen (:jonasfj)

Comment 11

•

8 years ago

we could do some hack like sending a header "x-taskcluster-by-pass-cache: true". - this would be a one-off hack, easily breaks in the future. Or we could as garndt suggested in email teach tc-treeherder to point to S3 directly. - this would break encapsulation of tc-queue, easily breaks in future. Options that doesn't incur massive technical debt and obscurity includes: A) cloud-mirror has a separate queue (high priority lane) for small artifacts B) cloud-mirror doesn't attempt to cache small artifacts and just redirects them directly C) treeherder log parsing accepts 30s delays at times (I'm not sure this is so unreasonable, using some async I/O should still have a decent through-put) Any other ideas?

William Lachance (:wlach)

Comment 12

•

8 years ago

(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #11) > B) cloud-mirror doesn't attempt to cache small artifacts and just redirects > them directly I think this would most likely be fine, as long as "small" is reasonably large (say, 10M?). There may be the odd log that's very large, but that should be rare. > C) treeherder log parsing accepts 30s delays at times > (I'm not sure this is so unreasonable, using some async I/O should still > have a decent through-put) We do accept the 30s delay, it just causes a large backlog of tasks since we normally have a fairly low limit of how many we can process at one time. It's hard to distinguish that situation where we have a large backlog due to cloud mirror from a genuine emergency.

Ed Morley [:emorley]

Reporter

Comment 13

•

8 years ago

A spike 45 mins ago, that lasted ~15mins: https://rpm.newrelic.com/accounts/677903/applications/14179757/externals?tw%5Bend%5D=1477501131&tw%5Bstart%5D=1477497531#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d

Greg Arndt [:garndt]

Comment 14

•

8 years ago

We definitely want to minimize how frequently this happens and adjust where necessary. Everyone is still brainstorming options here to see if there is a solution that will crop up that isn't a bandaid or hack in either system. The spike noted in comment 13 looks to be a spike up to 10 seconds for a duration of 13-15 minutes. Another option here is to move treeherder to heroku private spaces (us-west-2) after November 1st which will put the application closer to the bucket that we are serving artifacts from. Unfortunately, there is no guarantee in the future that we will be hosting artifacts from s3, and we could move it to another cloud vendor outside of AWS which could introduce latency for other reasons.

Ed Morley [:emorley]

Reporter

Comment 15

•

8 years ago

We've seen four spikes in request times in the last 26 hours: https://rpm.newrelic.com/accounts/677903/applications/14179757/externals?tw%5Bend%5D=1478811600&tw%5Bstart%5D=1478702376#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d ...the latest of which is still ongoing (35+ mins so far): https://rpm.newrelic.com/accounts/677903/applications/14179757/externals?tw%5Bend%5D=1478811600&tw%5Bstart%5D=1478808000#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d John, has something changed that's made things worse? (Still yet to hear back from you via a comment in this bug in the last 5 weeks..?)

Greg Arndt [:garndt]

Comment 16

•

8 years ago

Looking at the spike from yesterday, it appears that this happened for about 60 minutes, far greater duration than the spikes we've seen before. John and Jonas have worked on some patches for cloud-mirror recently that reduced the amount of errors we were seeing when transferring between regions but as we saw yesterday it didn't completely fix the issue. There is a PR that jonas is going to revisit about adding header support for bypassing the cache. That work is being done here https://github.com/taskcluster/taskcluster-queue/pull/129

Greg Arndt [:garndt]

Comment 17

•

8 years ago

This PR has landed and should be available shortly. Please see this comment on how to make use of it: https://github.com/taskcluster/taskcluster-queue/pull/129/files#diff-4154dccc54a0b78c85533301068ca355R457

Ed Morley [:emorley]

Reporter

Comment 18

•

8 years ago

Many thanks - we'll check that out shortly. Out of curiosity, has S3 cross-region replication (http://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html) been investigated as an alternative to cloud mirror? With the ability to filter by key prefix and have different lifecycle policies on both source and destination, it seems like a pretty good fit given that we know Treeherder will access every log uploaded - and would mean the objects start replicating as soon as they are uploaded, and not only on first request, thereby reducing the impact of any inter-region transfer slowdowns.

Greg Arndt [:garndt]

Comment 19

•

8 years ago

We have investigated it when we first introduced multiple region support. I do not recall all the technical details behind it, so Jonas or John might be able to fill in the gaps, but largely it was error prone and slow from what we could tell. Prefixing would work for very specific things, such as static log files a worker produces, but would not work for almost anything else that a task produces. Tasks can produce any arbitrary artifact and request that from any region we support.

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

8 years ago

(In reply to Ed Morley [:emorley] from comment #18) > given that we know Treeherder will access every log uploaded - and would We have a *lot* of tasks that are not reported to treeherder - I'd wager 30% or so.

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Comment 21

•

8 years ago

Ok, just wanted to check in case :-) I've filed bug 1319441 for Treeherder to start using x-taskcluster-skip-cache. Many thanks!

Ed Morley [:emorley]

Reporter

Comment 22

•

8 years ago

Bug 1319441 (setting the x-taskcluster-skip-cache header) is deployed on Treeherder stage but not yet prod, which gives opportunity for comparison. For example, last night prod experienced a 30 minute spike (starting ~20:04 UTC): https://rpm.newrelic.com/accounts/677903/applications/14179757/externals?tw%5Bend%5D=1480370697&tw%5Bstart%5D=1480357297#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d Whereas there was no spike on stage for the same timeframe: https://rpm.newrelic.com/accounts/677903/applications/14179733/externals?tw%5Bend%5D=1480370697&tw%5Bstart%5D=1480357297#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d If we switch to comparing total log parser transaction times (given streaming implementation for some of the log parser types, so the graphs above are partly time to first byte), we still see the same (everything working fine when we skip cloud mirror): ...Prod: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?show_browser=false&tw%5Bend%5D=1480370697&tw%5Bstart%5D=1480357297&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6c6f672d706172736572222c22225d ...Stage: https://rpm.newrelic.com/accounts/677903/applications/14179733/transactions?show_browser=false&tw%5Bend%5D=1480370697&tw%5Bstart%5D=1480357297&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6c6f672d706172736572222c22225d Unless I'm misunderstanding, this disproves the "sometimes transfers between regions are just slow for all consumers" theory? Anyway, so long as we can keep using the x-taskcluster-skip-cache header we're happy with this solution. It's then just an S3 transfer cost vs man hours decision for the taskcluster team as to whether to try and find the root cause.

Greg Arndt [:garndt]

Comment 23

•

8 years ago

Thanks for sharing some of your graphs and findings. We definitely want to get to the bottom of whatever issues there may be.

John Ford [:jhford] CET/CEST Berlin Time

Updated

•

8 years ago

Flags: needinfo?(jhford)

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Updated

•

8 years ago

Component: Tools → General

Chris Cooper [:coop] (he/him)

Comment 24

•

7 years ago

Found in triage. Ed: is this still happening?

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Reporter

Comment 25

•

7 years ago

Treeherder now partially bypasses cloud mirror (other than the redirect); I don't know if the underlying cause has been fixed.

Flags: needinfo?(emorley)

Chris Cooper [:coop] (he/him)

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WORKSFORME