Closed Bug 1305768 Opened 8 years ago Closed 6 years ago

Periodic spikes in response time for requests from Treeherder for task logs through cloud-mirror

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: emorley, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Continuation of bug 1286935 - filling a new bug as requested :-)
(Not sure which component is appropriate)

We're seeing spikes in response time:
https://rpm.newrelic.com/accounts/677903/applications/14179733/externals?tw%5Bend%5D=1474995195&tw%5Bstart%5D=1474951995#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d
(screenshot attached, for those without access)

These requests are being made from Heroku (AWS US-east) to queue.taskcluster.net for log artifacts.
Flags: needinfo?(jhford)
Component: General → Queue
Blocks: thunder-try
Jonas, this is filed under the queue, and not sure if it's related to some of the times outs you have been investigating.
Flags: needinfo?(jopsen)
Hmm, could be...

We still don't really have any idea what cause the excessive timeouts the last few days..
Maybe some connectivity issues. I'm talking to azure to try and get some insights.
But I really fear this is going to go unexplained :(
Flags: needinfo?(jopsen)
This isn't a problem with AWS, but cloud-mirror. The spikes in request times always max out at 30s, which is cloud-mirror's cut-off for "giving up, redirect to the original source".
Component: Queue → Tools
Summary: Periodic spikes in response time for requests to queue.taskcluster.net from AWS → Periodic spikes in response time for requests from Treeherder for task logs through cloud-mirror
We need to find a way for the two systems to work with each other in a better way.  John, can you please work on possible short term and long term solutions to this issue and document them here?
we could do some hack like sending a header "x-taskcluster-by-pass-cache: true".
  - this would be a one-off hack, easily breaks in the future.

Or we could as garndt suggested in email teach tc-treeherder to point to S3 directly.
  - this would break encapsulation of tc-queue, easily breaks in future.

Options that doesn't incur massive technical debt and obscurity includes:
 A) cloud-mirror has a separate queue (high priority lane) for small artifacts
 B) cloud-mirror doesn't attempt to cache small artifacts and just redirects them directly
 C) treeherder log parsing accepts 30s delays at times
    (I'm not sure this is so unreasonable, using some async I/O should still have a decent through-put)

Any other ideas?
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #11)

>  B) cloud-mirror doesn't attempt to cache small artifacts and just redirects
> them directly

I think this would most likely be fine, as long as "small" is reasonably large (say, 10M?). There may be the odd log that's very large, but that should be rare.

>  C) treeherder log parsing accepts 30s delays at times
>     (I'm not sure this is so unreasonable, using some async I/O should still
> have a decent through-put)

We do accept the 30s delay, it just causes a large backlog of tasks since we normally have a fairly low limit of how many we can process at one time. It's hard to distinguish that situation where we have a large backlog due to cloud mirror from a genuine emergency.
We definitely want to minimize how frequently this happens and adjust where necessary.  Everyone is still brainstorming options here to see if there is a solution that will crop up that isn't a bandaid or hack in either system.

The spike noted in comment 13 looks to be a spike up to 10 seconds for a duration of 13-15 minutes.

Another option here is to move treeherder to heroku private spaces (us-west-2) after November 1st which will put the application closer to the bucket that we are serving artifacts from.

Unfortunately, there is no guarantee in the future that we will be hosting artifacts from s3, and we could move it to another cloud vendor outside of AWS which could introduce latency for other reasons.
Looking at the spike from yesterday, it appears that this happened for about 60 minutes, far greater duration than the spikes we've seen before.

John and Jonas have worked on some patches for cloud-mirror recently that reduced the amount of errors we were seeing when transferring between regions but as we saw yesterday it didn't completely fix the issue.

There is a PR that jonas is going to revisit about adding header support for bypassing the cache.  That work is being done here https://github.com/taskcluster/taskcluster-queue/pull/129
This PR has landed and should be available shortly.  Please see this comment on how to make use of it:
https://github.com/taskcluster/taskcluster-queue/pull/129/files#diff-4154dccc54a0b78c85533301068ca355R457
Many thanks - we'll check that out shortly.

Out of curiosity, has S3 cross-region replication (http://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html) been investigated as an alternative to cloud mirror?

With the ability to filter by key prefix and have different lifecycle policies on both source and destination, it seems like a pretty good fit given that we know Treeherder will access every log uploaded - and would mean the objects start replicating as soon as they are uploaded, and not only on first request, thereby reducing the impact of any inter-region transfer slowdowns.
We have investigated it  when we first introduced multiple region support.  I do not recall all the technical details behind it, so Jonas or John might be able to fill in the gaps, but largely it was error prone and slow from what we could tell.

Prefixing would work for very specific things, such as static log files a worker produces, but would not work for almost anything else that a task produces.  Tasks can produce any arbitrary artifact and request that from any region we support.
(In reply to Ed Morley [:emorley] from comment #18)
> given that we know Treeherder will access every log uploaded - and would

We have a *lot* of tasks that are not reported to treeherder - I'd wager 30% or so.
See Also: → 1319441
Ok, just wanted to check in case :-)

I've filed bug 1319441 for Treeherder to start using x-taskcluster-skip-cache.

Many thanks!
Bug 1319441 (setting the x-taskcluster-skip-cache header) is deployed on Treeherder stage but not yet prod, which gives opportunity for comparison.

For example, last night prod experienced a 30 minute spike (starting ~20:04 UTC):
https://rpm.newrelic.com/accounts/677903/applications/14179757/externals?tw%5Bend%5D=1480370697&tw%5Bstart%5D=1480357297#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d

Whereas there was no spike on stage for the same timeframe:
https://rpm.newrelic.com/accounts/677903/applications/14179733/externals?tw%5Bend%5D=1480370697&tw%5Bstart%5D=1480357297#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d

If we switch to comparing total log parser transaction times (given streaming implementation for some of the log parser types, so the graphs above are partly time to first byte), we still see the same (everything working fine when we skip cloud mirror):
...Prod: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?show_browser=false&tw%5Bend%5D=1480370697&tw%5Bstart%5D=1480357297&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6c6f672d706172736572222c22225d
...Stage: https://rpm.newrelic.com/accounts/677903/applications/14179733/transactions?show_browser=false&tw%5Bend%5D=1480370697&tw%5Bstart%5D=1480357297&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f6c6f672d706172736572222c22225d

Unless I'm misunderstanding, this disproves the "sometimes transfers between regions are just slow for all consumers" theory?

Anyway, so long as we can keep using the x-taskcluster-skip-cache header we're happy with this solution. It's then just an S3 transfer cost vs man hours decision for the taskcluster team as to whether to try and find the root cause.
Thanks for sharing some of your graphs and findings.  We definitely want to get to the bottom of whatever issues there may be.
Flags: needinfo?(jhford)
See Also: → 1348072
Component: Tools → General
Found in triage.

Ed: is this still happening?
Flags: needinfo?(emorley)
Treeherder now partially bypasses cloud mirror (other than the redirect); I don't know if the underlying cause has been fixed.
Flags: needinfo?(emorley)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: