Closed Bug 1348072 Opened 7 years ago Closed 6 years ago

Adjust the log URLs submitted to Treeherder to avoid redirects and slow response times

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: jhford)

References

Details

Taskcluster provides Treeherder with log URLs of form:
https://queue.taskcluster.net/v1/task/Ve2bRtC_TAWov4Pex_e-jg/runs/0/artifacts/public/logs/live_backing.log

If accessed outside of AWS (or the 'x-taskcluster-skip-cache: True' header set, which is what Treeherder uses), the response from that URL is a HTTP 303 redirect to:
https://public-artifacts.taskcluster.net/Ve2bRtC_TAWov4Pex_e-jg/0/public/logs/live_backing.log

Unfortunately this redirect:
* increases the total request time, even in the best case when queue.taskcluster.net isn't under high load (see timings below).
* increases the variability of response time (eg bug 1347945 / bug 1348071).
* cannot be cached (unlike the final public-artifacts.taskcluster.net HTTP request), so causes unnecessary requests when used client-side.

These URLs are used:
A) server-side during log parsing 
B) client-side by the unified log viewer [1] (which Treeherder embeds [2])
C) client-side by the test-centric UI prototype (we're fetching the *_error_summary.log job artifacts client side for rapid iteration during the prototype phase)

Slow response times for (A) causes backlogs during log ingestion, and for (B)/(C) results in a less responsive end-user experience.

For example, even in the best case scenario (queue.taskcluster.net not experiencing a spike in response time):

$ for i in {1..5}; do curl -sSLo /dev/null --compressed -w "%{time_total}\n" "https://queue.taskcluster.net/v1/task/Ve2bRtC_TAWov4Pex_e-jg/runs/0/artifacts/public/logs/live_backing.log"; done
0.561
0.565
0.560
0.560
0.575

$ for i in {1..5}; do curl -sSLo /dev/null --compressed -w "%{time_total}\n" "https://public-artifacts.taskcluster.net/Ve2bRtC_TAWov4Pex_e-jg/0/public/logs/live_backing.log"; done
0.105
0.096
0.095
0.095
0.104

That's from the UK (which is representative of what I get loading the log viewer), but let's also try from Heroku (us-east-1) with the skip cache header set (since that's what Treeherder does for log parsing):

~ $ for i in {1..5}; do curl -sSLo /dev/null --compressed -H 'x-taskcluster-skip-cache: True' -w "%{time_total}\n" "https://queue.taskcluster.net/v1/task/Ve2bRtC_TAWov4Pex_e-jg/runs/0/artifacts/public/logs/live_backing.log"; done
0.474
0.172
0.266
0.205
0.179

~ $ for i in {1..5}; do curl -sSLo /dev/null --compressed -H 'x-taskcluster-skip-cache: True' -w "%{time_total}\n" "https://public-artifacts.taskcluster.net/Ve2bRtC_TAWov4Pex_e-jg/0/public/logs/live_backing.log"; done
0.047
0.053
0.048
0.060
0.070

I realise queue.taskcluster.net is meant to do smart redirection to cloud-mirror, but seeing that non-AWS consumers (eg users on their local machine using unified the log viewer) and Treeherder (which skips the cache) don't make use of this, they might as well use the public-artifacts.taskcluster.net URL directly.

I suppose Treeherder could do something hacky like rewrite the URLs (to public-artifacts.taskcluster.net) during ingestion, but I was hoping we could adjust what was submitted instead? (After writing this bug I've now come across bug 1305768 comment 11 which mentions similar.)

Thoughts? :-)


[1] https://taskcluster.github.io/unified-logviewer/?url=https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FVe2bRtC_TAWov4Pex_e-jg%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log&highlightStart=&highlightEnd=&lineNumber=&wrapLines=false&showLineNumbers=true&jumpToHighlight=false&followLog=false&lineHeight=13&asText=false
[2] https://treeherder.mozilla.org/logviewer.html#?job_id=84326524&repo=mozilla-inbound&lineNumber=1-23021
See Also: → 1348071
Flags: needinfo?(garndt)
Jonas, any comments you would like to add about the queue redirect delays? Also, any objections to taskcluster-treeherder being updated to just provide the public-artifacts.tc.net URL instead of queue.tc.net?  I do know the risk is that if queue.tc.net ever needed to redirect to somewhere other than public-artifacts.tc.net , then this would break.
Flags: needinfo?(garndt) → needinfo?(jopsen)
I don't have a short term solution...

We've talked about speeding up tc-queue, so far it haven't been critically necessary.
Doing it probably means moving to postgres or some other data store... A mid-term fix there could be caching in redis,
but even this is complicated to do, and only gives some improvement.

---
Down side of the hack garndt suggests is that:
 A) it can break at anytime, we make no promises (and do not wish to do so)
 B) it'll need an owner who is willing to suffer the consequences (probably won't be so bad).
 C) historical log URLs on treeherder could be broken

Note: it's not that we change internal URL scheme all the time, but we are NOT willing to lock these down.
That would require a larger consensus, I doubt feasible while jhford is refactoring artifact API :)

I see (C) as the largest issue, but maybe not since most things expire after 14 days anyways.
Do people lookup old stuff on treeherder?
Flags: needinfo?(jopsen)
I can empathise with not wanting to lock down the API.

However for URLs like:
https://public-artifacts.taskcluster.net/Ve2bRtC_TAWov4Pex_e-jg/0/public/logs/live_backing.log

...even if the API changes in the future, presumably all existing uploaded S3 objects will remain in the same locations? So as long as the URLs for new objects are correct (eg anyone changing the API makes a quick PR to taskcluster-treeherder first that removes this workaround temporarily) then old links won't break?

Or in this case would the S3 assets be moved around? (I'm not familiar with CloudFront, so not sure if it allows for independent mapping of URL structure compared to the S3 locations).

It's just at the moment I'm contemplating adding this hack to Treeherder instead, but that makes it less visible to whomever changes the API in the future, so taskcluster-treeherder would be preferable. (Another option is for me to make Treeherder store the original URL but morph it at point of use, which would allow for easy removal, though adds complexity/makes the API messy)
Flags: needinfo?(jopsen)
Since treeherder seems to have incredibly detailed requirements for log storage and access, maybe it makes sense for us to build a special bucket (or whatever we choose - blob storage, maybe Heroku offers some high-speed object storage add-on?) for logs for treeherder, and have tc-treeherder copy the logs there?
Including John on this conversation because I believe he has mentioned about storing artifacts in us-east which will help applications that also run in that region (such as those in heroku).

John, was this what you have proposed before? If so, could you elaborate on potential time savings that TH and other applications could expect by not having to transfer from us-west?

Side note: please dump your thoughts about using another s3 region for artifacts into an RFC so we can discuss it and come up with a plan of how we will accomplish it.
Flags: needinfo?(jhford)
Admittedly, that last comment could be in a larger overall bug about decreasing time for processing taskcluster logs rather than here which is about adjusting the log URL.  We can move the discussion elsewhere if needed.
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> Since treeherder seems to have incredibly detailed requirements for log storage and access

I wouldn't say we have detailed requirements, other than:
- not be slow
- avoid layers that have been known to cause unreliability in the past
- don't break browser caching

These requirements are of course less of a concern for non-browsers consumers that never need/use caching, or for consumers for whom the downloading portion is only a fraction of the task end-to-end time, but I don't think we're being particularly unusual?

In the case of bug 1347956 (gzipping), this would reduce time and bandwidth costs for non-Treeherder consumers too.
Just so I get it right, you'd like to remove redirects from the process altogether?  I don't think that this is something we should do.  We currently map the taskid, runid and name through this redirect.  Right now, we map:

    artifact(taskId, runId, name)
to
    https://queue.taskcluster.net/v1/task/:taskId/runs/:runId/artifacts/:name
and
    https://public-artifacts.taskcluster.net/:taskId/:runId/:name

but that's not something which will stay forever.  One possible complication in the future is that we'd like to consider storing artifacts in the ec2 region where they're created.  Another is that we'd like to be able store artifacts in a service other than S3.  Yet another is that we'd like to be able to store really small artifacts directly in the queue.  In all of these cases the URL for the backing storage is not deterministic.  There's no way to infer the backing storage's URL from the taskid, runid and name alone.  Since the queue owns artifact placement, it is also the source of information about where artifacts are located.

I think we could build the smarts for this into treeherder.  If taskcluster provides the type of artifact storage (e.g. S3 vs GCE), the region of the service (e.g. us-east-1, us-east-2), the taskid, the runid and the name, We would be able to have treeherder figure out what the URL would be.  The unfortunate thing here is that treeherder has to duplicate queue logic, but without doing that, we'd be trying to build precognition into our products or forcing ourselves into a specific artifact storage system which we'd not be able to change in future.

For the case of the live viewer, maybe it could be taught to use the finally redirected-to url (e.g. public-artifacts.tc.net) resource when loading new chunks if it doesn't already do that.

For browser caching, are the raw S3 urls cacheable as they are right now?  If not, do you know what is stopping them from being cacheable?

Gzipping is something that will be enabled by the new artifact api, since it will allow specifying a content-encoding field which will be preserved in S3.
Assignee: nobody → jhford
Flags: needinfo?(jhford)
> another option is for me to make Treeherder store the original URL but morph it at point of use,
> which would allow for easy removal, though adds complexity/makes the API messy
Hmm, actually that seems like a decent option.
Maybe make it fallback to the original URL, and send an error to sentry or write a log message warning if the original worked,
but the morphed URL didn't work.


Note: Long-term taskcluster-queue, will become reasonably fast.
We just haven't started refactoring the storage backend yet, but I think consensus is that it'll happen.


@jhford, logs are already gzip and have content-encoding.
Flags: needinfo?(jopsen)
No longer blocks: 1337488
https://github.com/taskcluster/taskcluster-queue/pull/233/files will add a feature to allow skipping both the CDN and Caches.

Ed, when that lands, if you have x-taskcluster-skip-cdn set to true or 1, you'll start skipping the CDN as well.  If that would solve your issues, let's mark this bug as resolved.
Flags: needinfo?(emorley)
Hi :-)

I think that might actually make things slower? ie: For users of the log viewer (which retrieves the log client side) who are not in the US, it sounds like that change might make them access the log directly from us-east rather than CloudFront? (Though how a CloudFront miss for the cold cache case compares to es-east direct I don't know; I would guess still faster)

The main performance issue being described here was that:
* queue.tc.net had slow/spiky/sometimes extremely high response times
* the redirect breaks browser caching

The latter is less of an issue now that the test-based Treeherder view performs the test manifest aggregation server side.

If the former has improved (/there are other bugs tracking that work), then I'm happy to close this out, since it sounds like the URL schema is not guaranteed, so rewriting the URLs to avoid the redirect would be risky.
Flags: needinfo?(emorley)
(In reply to Ed Morley [:emorley] from comment #11)
> I think that might actually make things slower? ie: For users of the log viewer

Ah sorry ignore this part - the x-taskcluster-skip-cdn header won't be set there.
This will be addressed partially by the new Object service (https://github.com/taskcluster/taskcluster-rfcs/pull/120), but also by moving the db to postgres.
(In reply to Ed Morley [:emorley] from comment #11)
> Hi :-)
> 
> I think that might actually make things slower? ie: For users of the log
> viewer (which retrieves the log client side) who are not in the US, it
> sounds like that change might make them access the log directly from us-east
> rather than CloudFront? (Though how a CloudFront miss for the cold cache
> case compares to es-east direct I don't know; I would guess still faster)

If you only use skip-cache, it will only skip cloud-mirror.  The old skip-cdn will still skip the cdn.  These options can be used independently

Basically:

if (region === 'us-west-2') {
  return s3_us-west-2_url;
} else if (skipCache && skipCdn) {
  return s3_us-west-2_url;
} else if (skipCache) {
  return cdn_url;
} else if (region.includes(['us-west-1', 'us-east-1'...])) {
  return cloud_mirror_url;
} else {
  return cdn_url;
}

> The main performance issue being described here was that:
> * queue.tc.net had slow/spiky/sometimes extremely high response times

Is this referring to the time from initiating a queue.tc.net request to finishing reading the ultimate 200 response body? Or is it the time from initiating the queue.tc.net request until the queue returns a 300 response?

> * the redirect breaks browser caching

Just having a redirect -- or is there something about how we're doing the redirect?  We'll likely never be able to serve artifacts directly from the queue, so there's nothing we can really do if that's the concern.

> 
> The latter is less of an issue now that the test-based Treeherder view
> performs the test manifest aggregation server side.
> 
> If the former has improved (/there are other bugs tracking that work), then
> I'm happy to close this out, since it sounds like the URL schema is not
> guaranteed, so rewriting the URLs to avoid the redirect would be risky.

General Queue performance (i.e. time from request to redirect) is tracked by our rewrite to Postgres, bug 1436478.  Optionally skipping cloud-mirror is addressed in bug 1419577.  Overall performance (i.e. time from request to queue to finished download) of S3 and/or CloudFront is mostly out of our control and so isn't actionable in a specific way.

Based on that information, and your comment that it's OK to close out if the work is being tracked, I'm going to close this bug.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Platform and Services → Services
You need to log in before you can comment on or make changes to this bug.