Closed Bug 1283505 Opened 6 years ago Closed 6 years ago

cloud-mirror returns HTTP 500s rather than preserving the original status code for 403s/404s

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: jhford)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

I spotted this exception on the new stage Heroku Treeherder instance:
https://rpm.newrelic.com/accounts/677903/applications/14179733/traced_errors/e6f381-0ec81774-3ed1-11e6-bd66-b82a72d22a14

Specifically:
urllib2:HTTPError: HTTP Error 500: Internal Server Error 

...whilst fetching:
https://queue.taskcluster.net/v1/task/Q93B9jdzRzCsnLeeIV0FaQ/runs/0/artifacts/public%2Flogs%2Flive_backing.log

I can reproduce when fetching from a Heroku one-off dyno:

$ ths run -- bash
Running bash on treeherder-stage... up, run.5155
~ $ curl -IL https://queue.taskcluster.net/v1/task/Q93B9jdzRzCsnLeeIV0FaQ/runs/0/artifacts/public%2Flogs%2Fli ve_backing.log
HTTP/1.1 303 See Other
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Location: https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FQ93B9jdzRzCsnLeeIV0FaQ%2F0%2Fpublic%2Flogs%2Flive_backing.log
Vary: Accept
Content-Type: text/plain; charset=utf-8
Content-Length: 29
Date: Thu, 30 Jun 2016 14:50:27 GMT
Via: 1.1 vegur

HTTP/1.1 500 Internal Server Error
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Content-Type: application/json; charset=utf-8
Content-Length: 927
Etag: W/"39f-30f27a73"
Date: Thu, 30 Jun 2016 14:50:28 GMT
Via: 1.1 vegur

However if I try this locally, I instead get:

$ curl -IL https://queue.taskcluster.net/v1/task/Q93B9jdzRzCsnLeeIV0FaQ/runs/0/artifacts/pu blic%2Flogs%2Flive_backing.log
HTTP/1.1 303 See Other
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Location: https://public-artifacts.taskcluster.net/Q93B9jdzRzCsnLeeIV0FaQ/0/public/logs/live_backing.log
Vary: Accept
Content-Type: text/plain; charset=utf-8
Content-Length: 29
Date: Thu, 30 Jun 2016 14:46:42 GMT
Via: 1.1 vegur

HTTP/1.1 403 Forbidden
Content-Type: application/xml
Connection: keep-alive
Date: Thu, 30 Jun 2016 14:46:42 GMT
Server: AmazonS3
X-Cache: Error from cloudfront
Via: 1.1 df6ab1791427412a072a852eb0428888.cloudfront.net (CloudFront)
X-Amz-Cf-Id: OKFQ1iRbLmOgbl532xRrEmHkbNp6MUj9vaJSCLt8JiGRSBLNruqo0A==

So:
1) I see that the local case accesses https://public-artifacts.taskcluster.net directly, whereas the Heroku case uses https://cloud-mirror.taskcluster.net. Presumably this is the cause of the 500 vs 403?
2) Can we fix the HTTP 500?
3) (less urgent) What's the cause of the HTTP 403? I presume more of bug 1154248?

Note: We currently fail silently for HTTP 403/404s (since they're ~expected due to bug 1154248), however we don't for the 500s - so if we can fix #2, it would at least reduce the noise on New Relic for us :-)

The job guid in question was:
43ddc1f6-3773-4730-ac9c-b79e215d0569/0

And repository:
fx-team
Flags: needinfo?(garndt)
The difference in the locations it fetches from ultimately is based on where the request is taking place.  If you are within AWS (such as heroku) then the queue will redirect you to our cloud-mirror instance which caches artifacts within the region of your request to save on region transfer costs.

If a request happens outside of a supported AWS region, then you will be redirected to a public endpoint not using cloud mirror.

I'm not sure why this artifact is giving a 500 when going through cloud mirror though.  John, do you have any ideas?
Flags: needinfo?(garndt) → needinfo?(jhford)
Ping on this? :-)

Also, in addition to fixing this HTTP 500, we should probably add some better monitoring for cloud-mirror.taskcluster.net, since it's going to be on the critical path for Taskcluster/Treeherder once we move to Heroku (if it isn't already on it already for other things in AWS). Ideally HTTP500s in any of our critical webapps should result in New Relic/papertrail/... alerts that are emailed out to people etc.
Hey, so I am doing some digging into this.  It appears that the input url is failing validation inside the cloud-mirror.  The case that it's specifically failing is that the http status code of the url as seen from within the us-west-2 region is >=200 && < 300, or a 300 that's a 304 or 305 (304 and 305 are not redirects).

I'm going to add a little bit of extra debugging code to see if I can figure out what's going on.
Flags: needinfo?(jhford)
simple PR to add lib-monitor in the redirect follower.
Attachment #8767716 - Flags: review?(bstack)
(In reply to John Ford [:jhford] from comment #3)
> Hey, so I am doing some digging into this.  It appears that the input url is
> failing validation inside the cloud-mirror.  The case that it's specifically
> failing is that the http status code of the url as seen from within the
> us-west-2 region is >=200 && < 300, or a 300 that's a 304 or 305 (304 and
> 305 are not redirects).

That would make sense, since when fetching locally (and so don't hit cloud-mirror), I get an HTTP 403. (See second half of comment 0).
So it seems that this issue can be described as:

If cloud mirror gets an HTTP status code in the 4XX range when fetching the log, it currently throws, when in fact it should propagate the same result back to the client, since it's not a server error.
(In reply to Ed Morley [:emorley] from comment #6)
> So it seems that this issue can be described as:
> 
> If cloud mirror gets an HTTP status code in the 4XX range when fetching the
> log, it currently throws, when in fact it should propagate the same result
> back to the client, since it's not a server error.

I don't suppose you've had another change to look at this?

It's just it's blocking the Treeherder migration to Heroku, which we were hoping to do this week or so.
Flags: needinfo?(jhford)
Assignee: nobody → jhford
Comment on attachment 8767716 [details] [review]
add some debugging info

This got r+ on the GitHub PR, though it still needs merging.
Attachment #8767716 - Flags: review?(bstack) → review+
John, I don't suppose you could take a look at this bug soon, since it blocks our Heroku migration? Thanks :-)

Or alternatively, if John's away (there's no Bugzilla activity since the 11th) -  Selena is there someone else who could take a look at this?
Flags: needinfo?(sdeckelmann)
I haven't had a chance to do a provisioner deployment but I will do it now.  This patch is a diagnostic so that we can see what's going wrong here.  When it's deployed, we should chat during the European daytime to get you to reproduce the error.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(jhford)
(In reply to John Ford [:jhford] from comment #10)
> I haven't had a chance to do a provisioner deployment but I will do it now. 
> This patch is a diagnostic so that we can see what's going wrong here.  When
> it's deployed, we should chat during the European daytime to get you to
> reproduce the error.

I mean cloud-mirror :)  The deployment process for cloud mirror is very different and very complicated for the time being.
Whilst the diagnostic PR is great to have landed regardless, I don't think we need it to fix this issue.

As mentioned in comment 6, this is a design/logic error - which can be confirmed from code inspection.

To repro, use the log URL in comment 0 from an AWS instance.
Hi Ed,

The problem here is not cloud-mirror, it's that the upstream artifact is not being stored by the queue somehow.

In the taskcluster-public-artifacts bucket, which is where the upstream thing is supposed to come from, run 0 does not exist.

Q93B9jdzRzCsnLeeIV0FaQ/runs/0/artifacts/public/logs/live_backing.log  is just simply not there.  /0/ is missing completely, only /1/ exists.

The live log for /1/ does exist.  The stack trace in cloud-mirror is showing that the error code is 403.  S3 does not use 404 because there could be an inaccessible object at that URL, and distinguishing between 403 and 404 would tell an attacker that such an object exists.

In your local testing, you're redirected by the *queue* to our cloudfront url: https://public-artifacts.taskcluster.net/Q93B9jdzRzCsnLeeIV0FaQ/0/public/logs/live_backing.log .  This could have a cache of the S3 artifact that no longer exists in S3 i suppose.  If you were to run the queue.tc.net/v1 endpoint inside of us-west-2, you'd get the 403 as expected.


Error in Cloud Mirror:

2016-07-14T15:30:45.227696+00:00 app[web.1]: Thu, 14 Jul 2016 15:30:45 GMT base:api Error occurred handling: /redirect/s3/us-east-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FQ93B9jdzRzCsnLeeIV0FaQ%2F0%2Fpublic%2Flogs%2Flive_backing.log, err: Error: error reading input url.  403, as JSON: {"code":"HTTPError"}, incidentId: 969cdf1c-5742-4b8c-bf25-0baad3c4a15b Error: error reading input url.  403
2016-07-14T15:30:45.227708+00:00 app[web.1]:     at /app/src/validate-url.js:97:17
2016-07-14T15:30:45.227708+00:00 app[web.1]:     at [object Generator].next (native)
2016-07-14T15:30:45.227709+00:00 app[web.1]:     at step (/app/lib/validate-url.js:120:191)
2016-07-14T15:30:45.227710+00:00 app[web.1]:     at /app/lib/validate-url.js:120:368


Errors from the client's perspective:

~ $ curl -IL https://queue.taskcluster.net/v1/task/Q93B9jdzRzCsnLeeIV0FaQ/runs/0/artifacts/public%2Flogs%2Flive_backing.log
HTTP/1.1 303 See Other
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Strict-Transport-Security: max-age=7776000
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Location: https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FQ93B9jdzRzCsnLeeIV0FaQ%2F0%2Fpublic%2Flogs%2Flive_backing.log
Vary: Accept
Content-Type: text/plain; charset=utf-8
Content-Length: 29
Date: Thu, 14 Jul 2016 15:45:57 GMT
Via: 1.1 vegur

HTTP/1.1 500 Internal Server Error
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Content-Type: application/json; charset=utf-8
Content-Length: 927
Etag: W/"39f-21e3401d"
Date: Thu, 14 Jul 2016 15:45:57 GMT
Via: 1.1 vegur


~ $ curl -IL https://queue.taskcluster.net/v1/task/Q93B9jdzRzCsnLeeIV0FaQ/runs/1/artifacts/public%2Flogs%2Flive_backing.log
HTTP/1.1 303 See Other
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Strict-Transport-Security: max-age=7776000
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Location: https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FQ93B9jdzRzCsnLeeIV0FaQ%2F1%2Fpublic%2Flogs%2Flive_backing.log
Vary: Accept
Content-Type: text/plain; charset=utf-8
Content-Length: 29
Date: Thu, 14 Jul 2016 15:45:08 GMT
Via: 1.1 vegur

HTTP/1.1 302 Found
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Location: https://cloud-mirror-production-us-east-1.s3.amazonaws.com/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FQ93B9jdzRzCsnLeeIV0FaQ%2F1%2Fpublic%2Flogs%2Flive_backing.log
Content-Type: application/json; charset=utf-8
Content-Length: 232
Etag: W/"e8-ea0d3cd3"
Date: Thu, 14 Jul 2016 15:45:11 GMT
Via: 1.1 vegur

HTTP/1.1 200 OK
x-amz-id-2: 2+WZV8qq85ChyECg8xywd4auUwYgh4/gbEMvaucO1vdISxQNEVXuw6ZdnitjwnCldTlnnI9g5Sw=
x-amz-request-id: 9BE1AD2B845BEA9E
Date: Thu, 14 Jul 2016 15:45:12 GMT
Last-Modified: Thu, 14 Jul 2016 15:45:11 GMT
x-amz-expiration: expiry-date="Sat, 16 Jul 2016 00:00:00 GMT", rule-id="us-east-1-1-day"
ETag: "a8736087623ead2aa539644821d86d65"
x-amz-meta-addresses: [{"code":200,"url":"https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Q93B9jdzRzCsnLeeIV0FaQ/1/public/logs/live_backing.log","t":"2016-07-14T15:45:09.428Z"}]
x-amz-meta-url: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Q93B9jdzRzCsnLeeIV0FaQ/1/public/logs/live_backing.log
x-amz-meta-upstream-etag: "a8736087623ead2aa539644821d86d65"
Content-Encoding: gzip
x-amz-meta-stored: 2016-07-14T15:45:09.428Z
Accept-Ranges: bytes
Content-Type: text/plain
Content-Length: 30233
Server: AmazonS3
Adjusting summary to reflect:

(In reply to Ed Morley [:emorley] from comment #6)
> So it seems that this issue can be described as:
> 
> If cloud mirror gets an HTTP status code in the 4XX range when fetching the
> log, it currently throws, when in fact it should propagate the same result
> back to the client, since it's not a server error.
Summary: HTTP 500s fetching live_backing.log when redirecting via https://cloud-mirror.taskcluster.net → cloud-mirror returns HTTP 500s rather than preserving the original status code for 403s/404s
From a us-east-1 machine:

Running bash on ⬢ cloud-mirror... up, run.4691
~ $ curl -IL https://queue.taskcluster.net/v1/task/Q93B9jdzRzCsnLeeIV0FaQ/runs/0/artifacts/public%2Flogs%2Flive_backing.log
HTTP/1.1 303 See Other
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Strict-Transport-Security: max-age=7776000
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Location: https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FQ93B9jdzRzCsnLeeIV0FaQ%2F0%2Fpublic%2Flogs%2Flive_backing.log
Vary: Accept
Content-Type: text/plain; charset=utf-8
Content-Length: 29
Date: Thu, 14 Jul 2016 16:44:50 GMT
Via: 1.1 vegur

HTTP/1.1 403 Forbidden
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
Access-Control-Request-Method: *
Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
Content-Type: application/json; charset=utf-8
Content-Length: 87
Etag: W/"57-32b1e3b8"
Date: Thu, 14 Jul 2016 16:44:50 GMT
Via: 1.1 vegur
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Confirmed from a one-off dyno on our Heroku instance, thank you :-)
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.