Closed Bug 1310643 Opened 8 years ago Closed 6 years ago

taskcluster-proxy is rewriting non-GET requests to be GET requests

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: pmoore)

References

Details

Attachments

(1 file)

See bug 1310561 for more detail, but basically what happened was that the requests made to the taskcluster-proxy were being submitted to the queue with the wrong HTTP Method.  This meant that a PUT to /task/:taskId was received by the queue as a GET to /task/:taskId.  Because the :taskId specified had not yet been created through a PUT, the GET endpoint couldn't find it, and so generated an error message that said we couldn't find the Task.

I reverted the work in https://github.com/taskcluster/taskcluster-proxy/pull/24 by changing the 'latest' to point back to 3.0.8 and also creating the 3.0.10 tag to represent that same work as well.

I haven't debugged the issue at all, but it looks like there's an 8 month regression window to dig through :/
+ Submitting SSL tests (upgradedb) @ linux32[arm-debug].
! FAIL: Scheduling Linux 64 (ASan, debug) w/ gcc-6.1 @ linux64[asan] failed. { Error: 00Qwr9cwSwm4Wa3e0TxmvQ does not correspond to a task that exists.
Are you sure this task has already been submitted?
----
errorCode:  ResourceNotFound
statusCode: 404
requestInfo:
  method:   task
  params:   {"taskId":"00Qwr9cwSwm4Wa3e0TxmvQ"}
  payload:  {}
  time:     2016-10-17T04:21:47.421Z
details:
{
  "taskId": "00Qwr9cwSwm4Wa3e0TxmvQ"
}
    at /home/worker/nss/automation/taskcluster/graph/node_modules/taskcluster-client/lib/client.js:324:21
    at /home/worker/nss/automation/taskcluster/graph/node_modules/promise/lib/core.js:33:15
    at flush (/home/worker/nss/automation/taskcluster/graph/node_modules/asap/asap.js:27:13)
    at _combinedTickCallback (internal/process/next_tick.js:67:7)
    at process._tickCallback (internal/process/next_tick.js:98:9)
  body: 
   { code: 'ResourceNotFound',
     message: '00Qwr9cwSwm4Wa3e0TxmvQ does not correspond to a task that exists.\nAre you sure this task has already been submitted?\n----\nerrorCode:  ResourceNotFound\nstatusCode: 404\nrequestInfo:\n  method:   task\n  params:   {"taskId":"00Qwr9cwSwm4Wa3e0TxmvQ"}\n  payload:  {}\n  time:     2016-10-17T04:21:47.421Z\ndetails:\n{\n  "taskId": "00Qwr9cwSwm4Wa3e0TxmvQ"\n}',
     requestInfo: 
      { method: 'task',
        params: [Object],
        payload: {},
        time: '2016-10-17T04:21:47.421Z' },
     details: { taskId: '00Qwr9cwSwm4Wa3e0TxmvQ' } },
  code: 'ResourceNotFound',
  statusCode: 404 }

---

Proxy logging:

Oct 16 23:21:47 docker-worker.aws-provisioner.us-west-1b.ami-43307e23.c3-xlarge.i-087b82a0d94a4673e docker-worker: 2016/10/17 04:21:47 Proxying /queue/v1/task/00Qwr9cwSwm4Wa3e0TxmvQ | GET | https://queue.taskcluster.net/v1/task/00Qwr9cwSwm4Wa3e0TxmvQ#015
logs are here:

https://public-artifacts.taskcluster.net/C_0mMq2uQUuTOqkTVz82lg/0/public/logs/live_backing.log

The code calling this is nothing weird:

https://hg.mozilla.org/projects/nss/file/803319e95365ea37b901f853c8f14f26c1a0ceca/automation/taskcluster/graph/src/queue.js#l201
          try {
            await queue.createTask(data.taskId, convertTask(image_task));
          } catch (e) {
            console.error("! FAIL: Scheduling image builder task failed.");
            continue; /* Skip this task on failure. */
          }
I didn't realize how much had been deployed this weekend.

https://github.com/taskcluster/taskcluster-proxy/commit/6e8d8b5984ce04f1b7fd6fbfa9b904e7e1aa76f1

or something nearby might be responsible.
Assignee: nobody → pmoore
Looking at the changes since then, the main piece is that there is a root handler that's used, but I do not see anything changing the request method, it looks like it's the same method made from the client being forwarded on.

https://github.com/taskcluster/taskcluster-proxy/blob/master/routes.go#L135
Also what is interesting about this is that gecko-decision tasks were not affected (or it wasn't reported), which should be using the same proxy.
Looking at sample gecko-decision machine [1] that spawned Saturday, it loaded the same taskcluter-proxy image [2] that the other workers [3] loaded that exhibited the problem, however the method was PUT for all the tasks it created:

Oct 15 19:16:59 docker-worker.aws-provisioner.us-west-2a.ami-cfe830af.c3-xlarge.i-053eff9921b5d2d9e docker-worker: 2016/10/16 00:16:59 Proxying /queue/v1/task/BkvO2TO9TQq9YTetMR_Fww | PUT | https://queue.taskcluster.net/v1/task/BkvO2TO9TQq9YTetMR_Fww#015 


[1] i-053eff9921b5d2d9e
[2] sha256:d02e6b303470ec72d4572f743a85f7f12b9d59a7d626796fdc851758d4721349
[3] i-087b82a0d94a4673e
I've noticed that the NSS decision task is implemented in javascript, using the JS taskcluster client, and that the gecko decision task is implemented in python, using the python taskcluster client.

I checked the apis.js file in the javascript taskcluster-client library, and noticed that for createTask, that the method is "put" rather than "PUT".

I wonder if the javascript client therefore is sending the method in lowercase, and the python client is sending in uppercase - and whether there was a change between go 1.5 (that the older taskcluster-proxy was compiled under) and go 1.7 (the the newer taskcluster-proxy was compiled under) that changed the handling of lowercase method names to default to GET, rather than to uppercase the method. I believe the HTTP spec requires an uppercase method.

This is all just a hunch to account for the fact gecko did not break, yet NSS did, and also that there was not a significant code change regarding the forwarding of the HTTP method - so wondering if this is a go 1.5 vs go 1.7 treatment change.

I'll do some testing to check this hypothesis.
Being a node novice, I'm hitting a compile problem while trying to reproduce locally.

I've pushed to github to demonstrate the compile problem in travis:
https://travis-ci.org/petemoore/pete-tutorial/builds/170073238#L164
Thanks to jhford for helping me fix that code.

Inside go, I am receiving the raw HTTP request as:


GET /queue/v1/task/654LvMHWRd-_Tv6YK9W_2g HTTP/1.1
Host: localhost:60024
Accept-Encoding: gzip, deflate
Authorization: Hawk id="mozilla-ldap/pmoore@mozilla.com/dev", ts="1477395283", nonce="tD53BI", mac="VPmmO4vA2Y2xber/vVbs2KPuSm/QIAUqiAwnM+BNFSE="
Connection: keep-alive
User-Agent: node-superagent/1.7.2


This is generated by:

https://github.com/petemoore/pete-tutorial/blob/40f299f1d3d5cdffb025010ea96ff0c648dbd788/src/test.js#L56


I have to pop out for a couple of hours now, will look at this when I'm back.

I'll see if I can capture on the node side of the exchange, what HTTP request it thinks it is generating.
Note: the "mac" above is a one time thing - so there should be no security concerns.
So if I run the same node script to generate a task, and listen with `nc`, I get what looks to be a correctly formed PUT request:

$ nc -l 60024 | cat
PUT /queue/v1//task/aXOnt0oAS4Sx2C3BPHabsQ HTTP/1.1
Host: localhost:60024
Accept-Encoding: gzip, deflate
user-agent: node-superagent/1.7.2
content-type: application/json
authorization: Hawk id="mozilla-ldap/pmoore@mozilla.com/dev", ts="1477409118", nonce="0vacAV", mac="cFxzRWZTBwlPSBczK3ekL11608myWEO8Djr6elMn9K4="
Content-Length: 812
Connection: keep-alive

so the big question ..... what is translating the PUT to a GET ?!
So it transpires that the following was taking place:

1) the baseUrl used in the queue client had a trailing slash: 

http://hg.mozilla.org/projects/nss/file/3e7b53b18112/automation/taskcluster/graph/src/queue.js#l18

This is why in comment 11, the PUT request has a double slash:
PUT /queue/v1//task/aXOnt0oAS4Sx2C3BPHabsQ HTTP/1.1

This resulted in the http handler including in the go standard library responding to the PUT request with a HTTP 301:

https://golang.org/pkg/net/http/#HandleFunc -> https://golang.org/pkg/net/http/#ServeMux.HandleFunc -> https://golang.org/pkg/path/#Clean (which cleans the double slashes)

This isn't currently explicitly documented in go 1.7.3, but I can see there is a doc fix already merged which presumably will reach a future go release:
https://github.com/golang/go/issues/13639

When the node-superagent client receives the 301 response, it then submits a GET request to the new location, dropping the http payload.

node-superagent declares conformance to the HTTP 1.1 specification in the PUT request. The HTTP 1.1 spec states in section 10.3.2[1]:

   If the 301 status code is received in response to a request other
   than GET or HEAD, the user agent MUST NOT automatically redirect the
   request unless it can be confirmed by the user, since this might
   change the conditions under which the request was issued.

Therefore, this looks like a non compliant implementation (i.e. a node-superagent bug).

It is also questionable whether http.ServeMux should issue 301 responses to non GET/HEAD http requests.

It looks like the change that introduced this behaviour in taskcluster-proxy was:
https://github.com/taskcluster/taskcluster-proxy/commit/6e8d8b5984ce04f1b7fd6fbfa9b904e7e1aa76f1#diff-7ddfb3e035b42cd70649cc33393fe32c

This change was to move from having a single giant handler for all endpoints, to having a separate handler per endpoint. As a side-effect, this introduced the "path cleaning" which, in combination with the node-superagent treatment of 301 responses, caused this bug.

I'll:

* submit a patch to drop the trailing slash of the queue baseUrl in NSS, as that will be cleaner, and allow NSS to work with the newer taskcluster-proxy docker image (which is orders of magnitude smaller than the previous release).
* raise my concern about the treatment of 301 responses from non- GET/HEAD requests in the node-superagent library (as a potential bug)
* raise the question with the go language team about whether a http 301 response is appropriate for non- GET/HEAD requests
* release a new taskcluster-proxy that does not use the http.ServeMux request multiplexer (at least not the default one, if the behaviour can be tweaked) so that no path cleaning takes place. Please note we have e.g. artifacts that contain double slashes, due to being created from a directory artifact containing a trailing slash - so e.g. a non-GET/HEAD request to such an artifact could be problematic (e.g. when publishing the artifact).


===
1) https://tools.ietf.org/html/rfc2616#section-10.3.2
Sample program to demonstrate behaviour of go library default request multiplexer:

package main

import (
        "bytes"
        "log"
        "net/http"
        "net/http/httputil"
)

func main() {
        http.HandleFunc("/", handle)
        go func() {
                err := http.ListenAndServe(":60024", nil) // set listen port
                if err != nil {
                        log.Fatal("ListenAndServe: ", err)
                }
        }()
        issuePut("http://localhost:60024/a//b")
        issuePut("http://localhost:60024/a/b")
}

func issuePut(url string) {
        req, err := http.NewRequest("PUT", url, bytes.NewBufferString("some content"))
        if err != nil {
                panic(err)
        }
        res, err := http.DefaultClient.Do(req)
        resRaw, err := httputil.DumpResponse(res, true)
        if err != nil {
                panic(err)
        }
        log.Println("Issuing PUT request to " + url + " ....\n\nHTTP response received:\n" + string(resRaw))
}

func handle(writer http.ResponseWriter, req *http.Request) {
        bytes, err := httputil.DumpRequest(req, true)
        if err != nil {
                panic(err)
        }
        writer.Write([]byte("HTTP request received:\n"))
        writer.Write(bytes)
}



Output when running it:

$ go run ~/x.go 
2016/10/25 20:36:29 Issuing PUT request to http://localhost:60024/a//b ....

HTTP response received:
HTTP/1.1 301 Moved Permanently
Content-Length: 0
Content-Type: text/plain; charset=utf-8
Date: Tue, 25 Oct 2016 18:36:29 GMT
Location: /a/b


2016/10/25 20:36:29 Issuing PUT request to http://localhost:60024/a/b ....

HTTP response received:
HTTP/1.1 200 OK
Content-Length: 154
Content-Type: text/plain; charset=utf-8
Date: Tue, 25 Oct 2016 18:36:29 GMT

HTTP request received:
PUT /a/b HTTP/1.1
Host: localhost:60024
Accept-Encoding: gzip
Content-Length: 12
User-Agent: Go-http-client/1.1

some content
For some reason, couldn't push this to try - but should be ok.

pmoore@Petes-iMac:~/hg/nss $ hg push -v -f -r . try
pushing to ssh://hg.mozilla.org/projects/nss-try
searching for changes
2 changesets found
uncompressed size of bundle content:
     731 (changelog)
     541 (manifests)
     206  automation/taskcluster/graph/src/queue.js
    2304  gtests/ssl_gtest/ssl_fuzz_unittest.cc
     578  lib/ssl/ssl3con.c
     246  lib/ssl/tls13con.c
abort: stream ended unexpectedly (got 0 bytes, expected 4)
remote: abort: could not lock working directory of /repo/hg/mozilla/projects/nss-try: Read-only file system
Attachment #8812447 - Flags: review?(ttaubert)
(In reply to Pete Moore [:pmoore][:pete] from comment #14)
> Created attachment 8812447 [details] [diff] [review]
> bug1310643_nss_v1.patch
> 
> For some reason, couldn't push this to try - but should be ok.
> 

Looks like it was an intermittent problem - I can push now:

  * https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=cee5a515c58cfc521bcaff475b262a5bfa8e348a
Attachment #8812447 - Flags: review?(ttaubert) → review+
Looks like I don't have authority to push... Could someone push this for me?

Many thanks!
Pete


======


pmoore@Petes-iMac:~/hg/nss $ hg push -r . ssh://hg.mozilla.org/projects/nss
pushing to ssh://hg.mozilla.org/projects/nss
searching for changes
abort: stream ended unexpectedly (got 0 bytes, expected 4)
remote: abort: could not lock working directory of /repo/hg/mozilla/projects/nss: Permission denied
Keywords: checkin-needed
Release version 1.6.1 of taskcluster-client which will strip trailing slashes from baseUrl.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Client Libraries → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: