1348072 - Adjust the log URLs submitted to Treeherder to avoid redirects and slow response times

Reporter

Description

•

8 years ago

Taskcluster provides Treeherder with log URLs of form: https://queue.taskcluster.net/v1/task/Ve2bRtC_TAWov4Pex_e-jg/runs/0/artifacts/public/logs/live_backing.log If accessed outside of AWS (or the 'x-taskcluster-skip-cache: True' header set, which is what Treeherder uses), the response from that URL is a HTTP 303 redirect to: https://public-artifacts.taskcluster.net/Ve2bRtC_TAWov4Pex_e-jg/0/public/logs/live_backing.log Unfortunately this redirect: * increases the total request time, even in the best case when queue.taskcluster.net isn't under high load (see timings below). * increases the variability of response time (eg bug 1347945 / bug 1348071). * cannot be cached (unlike the final public-artifacts.taskcluster.net HTTP request), so causes unnecessary requests when used client-side. These URLs are used: A) server-side during log parsing B) client-side by the unified log viewer [1] (which Treeherder embeds [2]) C) client-side by the test-centric UI prototype (we're fetching the *_error_summary.log job artifacts client side for rapid iteration during the prototype phase) Slow response times for (A) causes backlogs during log ingestion, and for (B)/(C) results in a less responsive end-user experience. For example, even in the best case scenario (queue.taskcluster.net not experiencing a spike in response time): $ for i in {1..5}; do curl -sSLo /dev/null --compressed -w "%{time_total}\n" "https://queue.taskcluster.net/v1/task/Ve2bRtC_TAWov4Pex_e-jg/runs/0/artifacts/public/logs/live_backing.log"; done 0.561 0.565 0.560 0.560 0.575 $ for i in {1..5}; do curl -sSLo /dev/null --compressed -w "%{time_total}\n" "https://public-artifacts.taskcluster.net/Ve2bRtC_TAWov4Pex_e-jg/0/public/logs/live_backing.log"; done 0.105 0.096 0.095 0.095 0.104 That's from the UK (which is representative of what I get loading the log viewer), but let's also try from Heroku (us-east-1) with the skip cache header set (since that's what Treeherder does for log parsing): ~ $ for i in {1..5}; do curl -sSLo /dev/null --compressed -H 'x-taskcluster-skip-cache: True' -w "%{time_total}\n" "https://queue.taskcluster.net/v1/task/Ve2bRtC_TAWov4Pex_e-jg/runs/0/artifacts/public/logs/live_backing.log"; done 0.474 0.172 0.266 0.205 0.179 ~ $ for i in {1..5}; do curl -sSLo /dev/null --compressed -H 'x-taskcluster-skip-cache: True' -w "%{time_total}\n" "https://public-artifacts.taskcluster.net/Ve2bRtC_TAWov4Pex_e-jg/0/public/logs/live_backing.log"; done 0.047 0.053 0.048 0.060 0.070 I realise queue.taskcluster.net is meant to do smart redirection to cloud-mirror, but seeing that non-AWS consumers (eg users on their local machine using unified the log viewer) and Treeherder (which skips the cache) don't make use of this, they might as well use the public-artifacts.taskcluster.net URL directly. I suppose Treeherder could do something hacky like rewrite the URLs (to public-artifacts.taskcluster.net) during ingestion, but I was hoping we could adjust what was submitted instead? (After writing this bug I've now come across bug 1305768 comment 11 which mentions similar.) Thoughts? :-) [1] https://taskcluster.github.io/unified-logviewer/?url=https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FVe2bRtC_TAWov4Pex_e-jg%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log&highlightStart=&highlightEnd=&lineNumber=&wrapLines=false&showLineNumbers=true&jumpToHighlight=false&followLog=false&lineHeight=13&asText=false [2] https://treeherder.mozilla.org/logviewer.html#?job_id=84326524&repo=mozilla-inbound&lineNumber=1-23021

Ed Morley [:emorley]

Reporter

Updated

•

8 years ago

Updated

•

8 years ago

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 1

•

8 years ago

Jonas, any comments you would like to add about the queue redirect delays? Also, any objections to taskcluster-treeherder being updated to just provide the public-artifacts.tc.net URL instead of queue.tc.net? I do know the risk is that if queue.tc.net ever needed to redirect to somewhere other than public-artifacts.tc.net , then this would break.

Flags: needinfo?(garndt) → needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Comment 2

•

8 years ago

I don't have a short term solution... We've talked about speeding up tc-queue, so far it haven't been critically necessary. Doing it probably means moving to postgres or some other data store... A mid-term fix there could be caching in redis, but even this is complicated to do, and only gives some improvement. --- Down side of the hack garndt suggests is that: A) it can break at anytime, we make no promises (and do not wish to do so) B) it'll need an owner who is willing to suffer the consequences (probably won't be so bad). C) historical log URLs on treeherder could be broken Note: it's not that we change internal URL scheme all the time, but we are NOT willing to lock these down. That would require a larger consensus, I doubt feasible while jhford is refactoring artifact API :) I see (C) as the largest issue, but maybe not since most things expire after 14 days anyways. Do people lookup old stuff on treeherder?

Flags: needinfo?(jopsen)

Ed Morley [:emorley]

Reporter

Comment 3

•

8 years ago

I can empathise with not wanting to lock down the API. However for URLs like: https://public-artifacts.taskcluster.net/Ve2bRtC_TAWov4Pex_e-jg/0/public/logs/live_backing.log ...even if the API changes in the future, presumably all existing uploaded S3 objects will remain in the same locations? So as long as the URLs for new objects are correct (eg anyone changing the API makes a quick PR to taskcluster-treeherder first that removes this workaround temporarily) then old links won't break? Or in this case would the S3 assets be moved around? (I'm not familiar with CloudFront, so not sure if it allows for independent mapping of URL structure compared to the S3 locations). It's just at the moment I'm contemplating adding this hack to Treeherder instead, but that makes it less visible to whomever changes the API in the future, so taskcluster-treeherder would be preferable. (Another option is for me to make Treeherder store the original URL but morph it at point of use, which would allow for easy removal, though adds complexity/makes the API messy)

Flags: needinfo?(jopsen)

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

8 years ago

Since treeherder seems to have incredibly detailed requirements for log storage and access, maybe it makes sense for us to build a special bucket (or whatever we choose - blob storage, maybe Heroku offers some high-speed object storage add-on?) for logs for treeherder, and have tc-treeherder copy the logs there?

Greg Arndt [:garndt]

Comment 5

•

8 years ago

Including John on this conversation because I believe he has mentioned about storing artifacts in us-east which will help applications that also run in that region (such as those in heroku). John, was this what you have proposed before? If so, could you elaborate on potential time savings that TH and other applications could expect by not having to transfer from us-west? Side note: please dump your thoughts about using another s3 region for artifacts into an RFC so we can discuss it and come up with a plan of how we will accomplish it.

Flags: needinfo?(jhford)

Greg Arndt [:garndt]

Comment 6

•

8 years ago

Admittedly, that last comment could be in a larger overall bug about decreasing time for processing taskcluster logs rather than here which is about adjusting the log URL. We can move the discussion elsewhere if needed.

Ed Morley [:emorley]

Reporter

Comment 7

•

8 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #4) > Since treeherder seems to have incredibly detailed requirements for log storage and access I wouldn't say we have detailed requirements, other than: - not be slow - avoid layers that have been known to cause unreliability in the past - don't break browser caching These requirements are of course less of a concern for non-browsers consumers that never need/use caching, or for consumers for whom the downloading portion is only a fraction of the task end-to-end time, but I don't think we're being particularly unusual? In the case of bug 1347956 (gzipping), this would reduce time and bandwidth costs for non-Treeherder consumers too.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 8

•

8 years ago

Just so I get it right, you'd like to remove redirects from the process altogether? I don't think that this is something we should do. We currently map the taskid, runid and name through this redirect. Right now, we map: artifact(taskId, runId, name) to https://queue.taskcluster.net/v1/task/:taskId/runs/:runId/artifacts/:name and https://public-artifacts.taskcluster.net/:taskId/:runId/:name but that's not something which will stay forever. One possible complication in the future is that we'd like to consider storing artifacts in the ec2 region where they're created. Another is that we'd like to be able store artifacts in a service other than S3. Yet another is that we'd like to be able to store really small artifacts directly in the queue. In all of these cases the URL for the backing storage is not deterministic. There's no way to infer the backing storage's URL from the taskid, runid and name alone. Since the queue owns artifact placement, it is also the source of information about where artifacts are located. I think we could build the smarts for this into treeherder. If taskcluster provides the type of artifact storage (e.g. S3 vs GCE), the region of the service (e.g. us-east-1, us-east-2), the taskid, the runid and the name, We would be able to have treeherder figure out what the URL would be. The unfortunate thing here is that treeherder has to duplicate queue logic, but without doing that, we'd be trying to build precognition into our products or forcing ourselves into a specific artifact storage system which we'd not be able to change in future. For the case of the live viewer, maybe it could be taught to use the finally redirected-to url (e.g. public-artifacts.tc.net) resource when loading new chunks if it doesn't already do that. For browser caching, are the raw S3 urls cacheable as they are right now? If not, do you know what is stopping them from being cacheable? Gzipping is something that will be enabled by the new artifact api, since it will allow specifying a content-encoding field which will be preserved in S3.

Assignee: nobody → jhford

Flags: needinfo?(jhford)

Jonas Finnemann Jensen (:jonasfj)

Comment 9

•

8 years ago

> another option is for me to make Treeherder store the original URL but morph it at point of use, > which would allow for easy removal, though adds complexity/makes the API messy Hmm, actually that seems like a decent option. Maybe make it fallback to the original URL, and send an error to sentry or write a log message warning if the original worked, but the morphed URL didn't work. Note: Long-term taskcluster-queue, will become reasonably fast. We just haven't started refactoring the storage backend yet, but I think consensus is that it'll happen. @jhford, logs are already gzip and have content-encoding.

Flags: needinfo?(jopsen)

Cameron Dawson [:camd]

Updated

•

7 years ago

No longer blocks: 1337488

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 10

•

7 years ago

https://github.com/taskcluster/taskcluster-queue/pull/233/files will add a feature to allow skipping both the CDN and Caches. Ed, when that lands, if you have x-taskcluster-skip-cdn set to true or 1, you'll start skipping the CDN as well. If that would solve your issues, let's mark this bug as resolved.

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Reporter

Comment 11

•

7 years ago

Hi :-) I think that might actually make things slower? ie: For users of the log viewer (which retrieves the log client side) who are not in the US, it sounds like that change might make them access the log directly from us-east rather than CloudFront? (Though how a CloudFront miss for the cold cache case compares to es-east direct I don't know; I would guess still faster) The main performance issue being described here was that: * queue.tc.net had slow/spiky/sometimes extremely high response times * the redirect breaks browser caching The latter is less of an issue now that the test-based Treeherder view performs the test manifest aggregation server side. If the former has improved (/there are other bugs tracking that work), then I'm happy to close this out, since it sounds like the URL schema is not guaranteed, so rewriting the URLs to avoid the redirect would be risky.

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Reporter

Comment 12

•

7 years ago

(In reply to Ed Morley [:emorley] from comment #11) > I think that might actually make things slower? ie: For users of the log viewer Ah sorry ignore this part - the x-taskcluster-skip-cdn header won't be set there.

Chris Cooper [:coop] (he/him)

Comment 13

•

7 years ago

This will be addressed partially by the new Object service (https://github.com/taskcluster/taskcluster-rfcs/pull/120), but also by moving the db to postgres.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 14

•

6 years ago

(In reply to Ed Morley [:emorley] from comment #11) > Hi :-) > > I think that might actually make things slower? ie: For users of the log > viewer (which retrieves the log client side) who are not in the US, it > sounds like that change might make them access the log directly from us-east > rather than CloudFront? (Though how a CloudFront miss for the cold cache > case compares to es-east direct I don't know; I would guess still faster) If you only use skip-cache, it will only skip cloud-mirror. The old skip-cdn will still skip the cdn. These options can be used independently Basically: if (region === 'us-west-2') { return s3_us-west-2_url; } else if (skipCache && skipCdn) { return s3_us-west-2_url; } else if (skipCache) { return cdn_url; } else if (region.includes(['us-west-1', 'us-east-1'...])) { return cloud_mirror_url; } else { return cdn_url; } > The main performance issue being described here was that: > * queue.tc.net had slow/spiky/sometimes extremely high response times Is this referring to the time from initiating a queue.tc.net request to finishing reading the ultimate 200 response body? Or is it the time from initiating the queue.tc.net request until the queue returns a 300 response? > * the redirect breaks browser caching Just having a redirect -- or is there something about how we're doing the redirect? We'll likely never be able to serve artifacts directly from the queue, so there's nothing we can really do if that's the concern. > > The latter is less of an issue now that the test-based Treeherder view > performs the test manifest aggregation server side. > > If the former has improved (/there are other bugs tracking that work), then > I'm happy to close this out, since it sounds like the URL schema is not > guaranteed, so rewriting the URLs to avoid the redirect would be risky. General Queue performance (i.e. time from request to redirect) is tracked by our rewrite to Postgres, bug 1436478. Optionally skipping cloud-mirror is addressed in bug 1419577. Overall performance (i.e. time from request to queue to finished download) of S3 and/or CloudFront is mostly out of our control and so isn't actionable in a specific way. Based on that information, and your comment that it's OK to close out if the work is being tracked, I'm going to close this bug.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Platform and Services → Services

Bugzilla

Adjust the log URLs submitted to Treeherder to avoid redirects and slow response times

Categories

(Taskcluster :: Services, enhancement)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: jhford)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated