Closed
Bug 1147977
Opened 10 years ago
Closed 10 years ago
tc-vcs: Mitigate slow TCP streams when downloading large S3 caches (parallel requests)
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jonasfj, Assigned: jlal)
References
Details
Attachments
(1 file, 1 obsolete file)
tl;dr: Be smarter when we download large artifacts from S3.
I suspect that if network is saturated when we start downloading a file from S3,
or some other weird thing happens TCP congestion control (or S3) might play
tricks on us and keep download speed low.
So when downloading large files we probably need to support restarting the
download if download speed goes below a certain threshold for an extended period
of time. This is sketchy because another task may be eating up all the bandwidth.
So slow download speed for extended period of time, maybe a valid condition.
We can also download large files in parallel and reassemble them after download.
All of this is done using the "range" header when downloading S3 artifacts.
--
Note, we might need to look into best practices, I'm sure we're not the first
to hit an issues with occasional slow downloads from S3.
For context see bug 1147867
Assignee | ||
Comment 1•10 years ago
|
||
The quick fix here is to use aria2c (which does the above) instead of curl... I can do this easily for the testers.
Reporter | ||
Comment 2•10 years ago
|
||
If aria2c is configured correctly with split, etc. that might be exactly what we are looking for.
We should probably add --timeout too... I suspect that's a timeout for the entire command, so we'll
probably want to keep top-level retries too.
Not sure how aria2c works, but we should have timeouts and retry after timeout, is my point :)
(I just had a quick look at the man file)
Assignee | ||
Comment 3•10 years ago
|
||
Playing with the configs now looks pretty easy to get a better then curl at least.
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → jlal
Status: NEW → ASSIGNED
Assignee | ||
Comment 4•10 years ago
|
||
After trying aria2c I think this is wrong option... I am going to reconfigure curl to be a bit smarter (since we retry curl downloads anyway this should be fine).
Assignee | ||
Comment 5•10 years ago
|
||
/r/6179 - Bug 1147977 - Add additional timeouts and retries to curl downloads in tc-vcs r=jonasfj
Pull down this commit:
hg pull review -r b58c3363566d90d43b8cc4df080e778d04996b12
Attachment #8584092 -
Flags: review?(jopsen)
Reporter | ||
Comment 6•10 years ago
|
||
Comment on attachment 8584092 [details]
MozReview Request: bz://1147977/lightsofapollo
https://reviewboard.mozilla.org/r/6177/#review5181
::: testing/docker/tester/tc-vcs-config.yml
(Diff revision 1)
> + get: curl --connect-timeout 30 --speed-limit 500000 -L -o {{dest}} {{url}}
Looks good to me...
Being explicit about --speed-time would be nice.
Numbers are debatable, less 0.5M is certainly bad. I would argue that we almost need 1M, but that could be too high and kill streams that would recover.
::: testing/docker/tester/tc-vcs-config.yml
(Diff revision 1)
> + repoUrl: https://git.mozilla.org/external/google/gerrit/git-repo.git
Do you overwrite this is the commands?
Or do these tests only clone gerrit?
If it's is a specific tester image it should probably be called something less generic than `tester`.
Attachment #8584092 -
Flags: review?(jopsen)
Reporter | ||
Comment 7•10 years ago
|
||
Comment on attachment 8584092 [details]
MozReview Request: bz://1147977/lightsofapollo
https://reviewboard.mozilla.org/r/6177/#review5183
Ship It!
Attachment #8584092 -
Flags: review+
Reporter | ||
Comment 8•10 years ago
|
||
ahh... I see... you overwrote the default config file from:
https://github.com/taskcluster/taskcluster-vcs/blob/master/default_config.yml
Why not just fix this in taskcluster-vcs and update the image with new version from npm?
Then we don't have to have a config file in tree that just configures internals of tc-vcs.
Assignee | ||
Comment 9•10 years ago
|
||
The reason is very simple the metrics we want for CI might be very different then what is normal for a non CI user (reproducibility is only possible if you use the whole docker container anyway...).
A good example is if someone in Paris office ran this (one it would currently suck) but also I doubt they would get a decent connection that would be even 500k/s.
Flags: needinfo?(jopsen)
Reporter | ||
Comment 10•10 years ago
|
||
@jlal,
That makes sense. But it's still hard to run these test locally.
Because the docker image contains the configuration.
The optimal solution would be an option as environment variable:
TASKCLUSTER_VCS_MIN_BANDWIDTH = 500000
And if this is specified we provide -y and -Y for CURL when downloading.
(with $inherit this wouldn't be a nuisance in-tree).
The optimal solution is probably still parallel downloads.
Say 100MB chunks with a timeout 8min timeout on each chunk.
Likelihood that all chunks are slow is very small, and likely implies network issues.
Anyways, I'm okay with this solution for now, land it :)
Note, getting 0.5M/s to S3 is very likely anywhere in the world, to/from any aws region.
I certainly had no issues with S3 access times when I lived in Denmark.
That said public wifi, etc... we shouldn't make assumptions about developer internet connection speed.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 11•10 years ago
|
||
Comment 12•10 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
status-firefox39:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Assignee | ||
Comment 13•10 years ago
|
||
Attachment #8584092 -
Attachment is obsolete: true
Attachment #8619880 -
Flags: review+
Assignee | ||
Comment 14•10 years ago
|
||
Updated•10 years ago
|
status-firefox39:
fixed → ---
Component: TaskCluster → General
Product: Testing → Taskcluster
Target Milestone: mozilla39 → mozilla41
Version: unspecified → Trunk
Comment 15•10 years ago
|
||
Resetting Version and Target Milestone that accidentally got changed...
Target Milestone: mozilla41 → ---
Version: Trunk → unspecified
You need to log in
before you can comment on or make changes to this bug.
Description
•