Closed Bug 1147977 Opened 10 years ago Closed 10 years ago

tc-vcs: Mitigate slow TCP streams when downloading large S3 caches (parallel requests)

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: jlal)

References

Details

Attachments

(1 file, 1 obsolete file)

MozReview Request: bz://1147977/lightsofapollo 10 years ago James Lal [:lightsofapollo] 39 bytes, text/x-review-board-request	jonasfj : review+	Details
MozReview Request: Bug 1147977 - Add additional timeouts and retries to curl downloads in tc-vcs r=jonasfj 10 years ago James Lal [:lightsofapollo] 39 bytes, text/x-review-board-request	jonasfj : review+	Details

Jonas Finnemann Jensen (:jonasfj)

Reporter

Description

•

10 years ago

tl;dr: Be smarter when we download large artifacts from S3. I suspect that if network is saturated when we start downloading a file from S3, or some other weird thing happens TCP congestion control (or S3) might play tricks on us and keep download speed low. So when downloading large files we probably need to support restarting the download if download speed goes below a certain threshold for an extended period of time. This is sketchy because another task may be eating up all the bandwidth. So slow download speed for extended period of time, maybe a valid condition. We can also download large files in parallel and reassemble them after download. All of this is done using the "range" header when downloading S3 artifacts. -- Note, we might need to look into best practices, I'm sure we're not the first to hit an issues with occasional slow downloads from S3. For context see bug 1147867

James Lal [:lightsofapollo]

Assignee

Comment 1

•

10 years ago

The quick fix here is to use aria2c (which does the above) instead of curl... I can do this easily for the testers.

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 2

•

10 years ago

If aria2c is configured correctly with split, etc. that might be exactly what we are looking for. We should probably add --timeout too... I suspect that's a timeout for the entire command, so we'll probably want to keep top-level retries too. Not sure how aria2c works, but we should have timeouts and retry after timeout, is my point :) (I just had a quick look at the man file)

James Lal [:lightsofapollo]

Assignee

Comment 3

•

10 years ago

Playing with the configs now looks pretty easy to get a better then curl at least.

James Lal [:lightsofapollo]

Assignee

Updated

•

10 years ago

Assignee: nobody → jlal

Status: NEW → ASSIGNED

James Lal [:lightsofapollo]

Assignee

Comment 4

•

10 years ago

After trying aria2c I think this is wrong option... I am going to reconfigure curl to be a bit smarter (since we retry curl downloads anyway this should be fine).

James Lal [:lightsofapollo]

Assignee

Comment 5

•

10 years ago

Attached file MozReview Request: bz://1147977/lightsofapollo (obsolete) — Details

/r/6179 - Bug 1147977 - Add additional timeouts and retries to curl downloads in tc-vcs r=jonasfj Pull down this commit: hg pull review -r b58c3363566d90d43b8cc4df080e778d04996b12

Attachment #8584092 - Flags: review?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 6

•

10 years ago

Comment on attachment 8584092 [details] MozReview Request: bz://1147977/lightsofapollo https://reviewboard.mozilla.org/r/6177/#review5181 ::: testing/docker/tester/tc-vcs-config.yml (Diff revision 1) > + get: curl --connect-timeout 30 --speed-limit 500000 -L -o {{dest}} {{url}} Looks good to me... Being explicit about --speed-time would be nice. Numbers are debatable, less 0.5M is certainly bad. I would argue that we almost need 1M, but that could be too high and kill streams that would recover. ::: testing/docker/tester/tc-vcs-config.yml (Diff revision 1) > + repoUrl: https://git.mozilla.org/external/google/gerrit/git-repo.git Do you overwrite this is the commands? Or do these tests only clone gerrit? If it's is a specific tester image it should probably be called something less generic than `tester`.

Attachment #8584092 - Flags: review?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 7

•

10 years ago

Comment on attachment 8584092 [details] MozReview Request: bz://1147977/lightsofapollo https://reviewboard.mozilla.org/r/6177/#review5183 Ship It!

Attachment #8584092 - Flags: review+

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 8

•

10 years ago

ahh... I see... you overwrote the default config file from: https://github.com/taskcluster/taskcluster-vcs/blob/master/default_config.yml Why not just fix this in taskcluster-vcs and update the image with new version from npm? Then we don't have to have a config file in tree that just configures internals of tc-vcs.

James Lal [:lightsofapollo]

Assignee

Comment 9

•

10 years ago

The reason is very simple the metrics we want for CI might be very different then what is normal for a non CI user (reproducibility is only possible if you use the whole docker container anyway...). A good example is if someone in Paris office ran this (one it would currently suck) but also I doubt they would get a decent connection that would be even 500k/s.

Flags: needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 10

•

10 years ago

@jlal, That makes sense. But it's still hard to run these test locally. Because the docker image contains the configuration. The optimal solution would be an option as environment variable: TASKCLUSTER_VCS_MIN_BANDWIDTH = 500000 And if this is specified we provide -y and -Y for CURL when downloading. (with $inherit this wouldn't be a nuisance in-tree). The optimal solution is probably still parallel downloads. Say 100MB chunks with a timeout 8min timeout on each chunk. Likelihood that all chunks are slow is very small, and likely implies network issues. Anyways, I'm okay with this solution for now, land it :) Note, getting 0.5M/s to S3 is very likely anywhere in the world, to/from any aws region. I certainly had no issues with S3 access times when I lived in Denmark. That said public wifi, etc... we shouldn't make assumptions about developer internet connection speed.

Flags: needinfo?(jopsen)

James Lal [:lightsofapollo]

Assignee

Comment 11

•

10 years ago

https://hg.mozilla.org/integration/b2g-inbound/rev/2f37a253a534

Ryan VanderMeulen [:RyanVM]

Comment 12

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/2f37a253a534

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

status-firefox39: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla39

James Lal [:lightsofapollo]

Assignee

Comment 13

•

10 years ago

Attachment #8584092 - Attachment is obsolete: true

Attachment #8619880 - Flags: review+

James Lal [:lightsofapollo]

Assignee

Comment 14

•

10 years ago

Attached file MozReview Request: Bug 1147977 - Add additional timeouts and retries to curl downloads in tc-vcs r=jonasfj — Details

Pete Moore [:pmoore][:pete]

Updated

•

10 years ago

status-firefox39: fixed → ---

Component: TaskCluster → General

Product: Testing → Taskcluster

Target Milestone: mozilla39 → mozilla41

Version: unspecified → Trunk

Pete Moore [:pmoore][:pete]

Comment 15

•

10 years ago

Resetting Version and Target Milestone that accidentally got changed...

Target Milestone: mozilla41 → ---

Version: Trunk → unspecified

You need to log in before you can comment on or make changes to this bug.

Bugzilla

tc-vcs: Mitigate slow TCP streams when downloading large S3 caches (parallel requests)

Categories

(Taskcluster :: General, defect)

Tracking

(Not tracked)

People

(Reporter: jonasfj, Assigned: jlal)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Comment 15

Attachment

General

Description

File Name

Content Type