Closed Bug 1133876 Opened 5 years ago Closed 5 years ago

python-client: Ensure that retries works with taskcluster.utils.putFile

Categories

(Taskcluster :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Unassigned)

References

Details

The requests library docs is not clear on whether or not it seeks a file to 0,
before streaming it's body. From looking at the requests code I suspect it doesn't seek to zero.

As a result taskcluster.utils.putFile, will probably fail to put the
entire file if the first retry files. And since S3 requires content-length
to be correct, subsequent retries will also fail. So we get no file.

We should fix this, either in `makeSingleHttpRequest` by seeking to zero, if
the payload has a seek method. Or by reimplementing retry logic in
taskcluster.utils.putFile.

See:
https://github.com/taskcluster/taskcluster-client.py/blob/1b1ec3b2c3fe5d422bbc7cd9cb51914b16d2a561/taskcluster/utils.py#L140-L148


Note, requests might seek to zero, from the code I read I strongly suspect it only does so for multi-part uploads. Not simple PUT requests.
Either way, we should test that this works! And fix this.
It's extremely likely that putFile retry logic is broken.
We really should test retry logic in our test suite. Both with 500 and connection errors.
If you run a localhost server, it's easy to make a method that fails every second time, don't know if httmock will let you do that.

Anyways, it's very important to test retry logic, because it's only used very rarely. And debugging these cases is very hard. So we should test it for both DNS errors, connection errors, 5xx errors, etc.
I tested this locally by temporarily redirecting taskcluster-artifacts.s3-us-west-2.amazonaws.com to 127.0.0.1 to trigger bug 1133856, and then remove that line from /etc/hosts so that one of the retries succeeds. In this case, I see a 200 response for the first retry, and then 400 responses for the next several retries.

I added this to makeHttpRequest:

      if hasattr(payload, 'seek'):
        payload.seek(0)
      response = makeSingleHttpRequest(...)

And then I saw a string of 200's. So it seems jonasfj's intuition is correct and we need to add the seek on retries.

However, even in the case where subsequent puts got a 400 response, the artifact that was attached to the task was still good (ie: it had the full text in the file that I uploaded - it wasn't subsequently overwritten with nothing). So I don't think this fully explains the behavior I saw in task g-8Wo0oST5iRYfdGapGlrg where the artifact exists in the task but is empty, and returns a 404 when trying to download it.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #1)
> We really should test retry logic in our test suite. Both with 500 and
> connection errors.
> If you run a localhost server, it's easy to make a method that fails every
> second time, don't know if httmock will let you do that.
> 
> Anyways, it's very important to test retry logic, because it's only used
> very rarely. And debugging these cases is very hard. So we should test it
> for both DNS errors, connection errors, 5xx errors, etc.

Agreed, though I don't know how to add these test cases into taskcluster-client.py's current architecture.
@jhford,
Neither mshal or I are familiar with httmock, and/or the python testing framework.
If you have time please drop-in and save the day :)
Flags: needinfo?(jhford)
Blocks: 1117960
I've a fix for the issue here in a patch for bug 1134369.
I think it covers most, if not all the issues.
I think/hope this is fixed with bug 1134369
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Flags: needinfo?(jhford)
Component: TaskCluster → General
Product: Testing → Taskcluster
Target Milestone: --- → mozilla41
Version: unspecified → Trunk
Resetting Version and Target Milestone that accidentally got changed...
Target Milestone: mozilla41 → ---
Version: Trunk → unspecified
You need to log in before you can comment on or make changes to this bug.