Closed Bug 1147977 Opened 5 years ago Closed 5 years ago

tc-vcs: Mitigate slow TCP streams when downloading large S3 caches (parallel requests)

Categories

(Taskcluster :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: jlal)

References

Details

Attachments

(1 file, 1 obsolete file)

tl;dr: Be smarter when we download large artifacts from S3.

I suspect that if network is saturated when we start downloading a file from S3,
or some other weird thing happens TCP congestion control (or S3) might play
tricks on us and keep download speed low.

So when downloading large files we probably need to support restarting the
download if download speed goes below a certain threshold for an extended period
of time. This is sketchy because another task may be eating up all the bandwidth.
So slow download speed for extended period of time, maybe a valid condition.

We can also download large files in parallel and reassemble them after download.

All of this is done using the "range" header when downloading S3 artifacts.

--
Note, we might need to look into best practices, I'm sure we're not the first
to hit an issues with occasional slow downloads from S3.
For context see bug 1147867
The quick fix here is to use aria2c (which does the above) instead of curl... I can do this easily for the testers.
If aria2c is configured correctly with split, etc. that might be exactly what we are looking for.
We should probably add --timeout too... I suspect that's a timeout for the entire command, so we'll
probably want to keep top-level retries too.

Not sure how aria2c works, but we should have timeouts and retry after timeout, is my point :)
(I just had a quick look at the man file)
Playing with the configs now looks pretty easy to get a better then curl at least.
Assignee: nobody → jlal
Status: NEW → ASSIGNED
After trying aria2c I think this is wrong option... I am going to reconfigure curl to be a bit smarter (since we retry curl downloads anyway this should be fine).
/r/6179 - Bug 1147977 - Add additional timeouts and retries to curl downloads in tc-vcs r=jonasfj

Pull down this commit:

hg pull review -r b58c3363566d90d43b8cc4df080e778d04996b12
Attachment #8584092 - Flags: review?(jopsen)
Comment on attachment 8584092 [details]
MozReview Request: bz://1147977/lightsofapollo

https://reviewboard.mozilla.org/r/6177/#review5181

::: testing/docker/tester/tc-vcs-config.yml
(Diff revision 1)
> +  get: curl --connect-timeout 30 --speed-limit 500000 -L -o {{dest}} {{url}}

Looks good to me...
Being explicit about --speed-time would be nice.
Numbers are debatable, less 0.5M is certainly bad. I would argue that we almost need 1M, but that could be too high and kill streams that would recover.

::: testing/docker/tester/tc-vcs-config.yml
(Diff revision 1)
> +  repoUrl: https://git.mozilla.org/external/google/gerrit/git-repo.git

Do you overwrite this is the commands?
Or do these tests only clone gerrit?

If it's is a specific tester image it should probably be called something less generic than `tester`.
Attachment #8584092 - Flags: review?(jopsen)
Comment on attachment 8584092 [details]
MozReview Request: bz://1147977/lightsofapollo

https://reviewboard.mozilla.org/r/6177/#review5183

Ship It!
Attachment #8584092 - Flags: review+
ahh... I see... you overwrote the default config file from:
https://github.com/taskcluster/taskcluster-vcs/blob/master/default_config.yml

Why not just fix this in taskcluster-vcs and update the image with new version from npm?
Then we don't have to have a config file in tree that just configures internals of tc-vcs.
The reason is very simple the metrics we want for CI might be very different then what is normal for a non CI user (reproducibility is only possible if you use the whole docker container anyway...).

A good example is if someone in Paris office ran this (one it would currently suck) but also I doubt they would get a decent connection that would be even 500k/s.
Flags: needinfo?(jopsen)
@jlal,
That makes sense. But it's still hard to run these test locally.
Because the docker image contains the configuration.

The optimal solution would be an option as environment variable:
  TASKCLUSTER_VCS_MIN_BANDWIDTH = 500000
And if this is specified we provide -y and -Y for CURL when downloading.
(with $inherit this wouldn't be a nuisance in-tree).

The optimal solution is probably still parallel downloads.
Say 100MB chunks with a timeout 8min timeout on each chunk.
Likelihood that all chunks are slow is very small, and likely implies network issues.

Anyways, I'm okay with this solution for now, land it :)

Note, getting 0.5M/s to S3 is very likely anywhere in the world, to/from any aws region.
I certainly had no issues with S3 access times when I lived in Denmark.
That said public wifi, etc... we shouldn't make assumptions about developer internet connection speed.
Flags: needinfo?(jopsen)
https://hg.mozilla.org/mozilla-central/rev/2f37a253a534
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Attachment #8584092 - Attachment is obsolete: true
Attachment #8619880 - Flags: review+
Component: TaskCluster → General
Product: Testing → Taskcluster
Target Milestone: mozilla39 → mozilla41
Version: unspecified → Trunk
Resetting Version and Target Milestone that accidentally got changed...
Target Milestone: mozilla41 → ---
Version: Trunk → unspecified
You need to log in before you can comment on or make changes to this bug.