If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

git+http doesn't appear to honor keep alive settings with Centos 6

RESOLVED FIXED

Status

Release Engineering
General Automation
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: hwine, Assigned: wcosta)

Tracking

(Depends on: 1 bug)

unspecified
Dependency tree / graph

Firefox Tracking Flags

(firefox42 fixed)

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(2 attachments)

(Reporter)

Description

2 years ago
tl;dr: timeouts fetching from git.mozilla.org may be due to lack of HTTP keep alive support in libcurl

Over the last few days, there have been numerous reports of timeouts in TC jobs interacting with git.mozilla.org.

Due to excellent detective work by a combined TC, MOC, Dev Services & Releng Crew, the following events were noticed:
 - TC builder client had a "hung" git fetch for /external/caf/platform/external/libpng.git
 - TC builder client had a TCP socket in CLOSE_WAIT state
 - git1.dmz.scl3 did not have an associated connection
 - git1.dmz.scl3 did have some "client disconnect" messages, but unclear if related
 - zlb VIP did not have an associated connection
 - zlb does not log connection terminations

The socket in CLOSE_WAIT state triggered a check of keep alive configuration. Neither client nor server override the default setting of 5 seconds for git protocol connections.

However, while researching if there was a configuration setting for keepalive on git+HTTP, the following article http://git.661346.n2.nabble.com/PATCH-http-enable-keepalive-on-TCP-sockets-td7597589.html suggested that git+HTTP keepalives were only supported with libcurl version 7.25 and later.

Investigation of the TC builder client showed it is using centos6, which has version 7.16.7 of libcurl.

There are several options from here - that's what this bug is to coordinate.
Duplicate of this bug: 1177322
Awesome detective work guys, and nice summary!
Do we understand why this only started to affect us so severely within the last week?
(Reporter)

Comment 4

2 years ago
Moving bug - also occurring in Buildbot jobs, which also use Centos6 builders.
Component: TaskCluster → General Automation
Product: Testing → Release Engineering
QA Contact: catlee
(Reporter)

Comment 5

2 years ago
Created attachment 8626311 [details] [diff] [review]
timeout.patch

WORKAROUND: change timeout to allow quicker fails while rest of problem investigated.

"10 min" picked as 33% higher than average time.
Attachment #8626311 - Flags: review?(catlee)
(Reporter)

Comment 6

2 years ago
Comment on attachment 8626311 [details] [diff] [review]
timeout.patch

r+ from :catlee IRL (yay WW)
Attachment #8626311 - Flags: review?(catlee) → review+
(Reporter)

Comment 7

2 years ago
Comment on attachment 8626311 [details] [diff] [review]
timeout.patch

https://hg.mozilla.org/build/mozharness/rev/f5d11e85e980
Attachment #8626311 - Flags: checked-in+
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Failure logs with the newer 600s timeout:
https://treeherder.mozilla.org/logviewer.html#?job_id=11126605&repo=mozilla-inbound
(Reporter)

Comment 11

2 years ago
Next step is to get someone from b2g build team to debug the "repo tool" output and/or add debugging output to it.

Until we know what specific command is failing, and how, we're stuck.

ni: mwu for help and/or a reference
Flags: needinfo?(mwu)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
This is arguably a tree-closing issue (or hide most B2G emulator/device image builds), so this needs attention. I'm not sure who's around right now, but a ~50+% failure rate isn't acceptable and needs attention from *someone* ASAP.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(jonas)
Flags: needinfo?(jocheng)
Flags: needinfo?(faramarz)
Flags: needinfo?(fabrice)

Comment 27

2 years ago
No references here. We don't usually mess with git and repo. Sounds like something changed on the automation side and needs to be backed out. Alternately, you can experiment with upgrading git and/or libcurl.
Flags: needinfo?(mwu)
:wcosta is working right now on upgrading libcurl to stop burning builds. The longer-term fix here is probably getting off CentOS6.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(jonas)
Flags: needinfo?(fabrice)
(Assignee)

Updated

2 years ago
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
(Assignee)

Comment 29

2 years ago
Created attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

Bug 1177190: Update libcurl in docker images. r=selena

libcurl shipped with CentOS 6 doesn't support keepalive. This is causing
builds to burn.
(Assignee)

Updated

2 years ago
Attachment #8627515 - Flags: review?(sdeckelmann)
(Assignee)

Comment 30

2 years ago
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

Bug 1177190: Update libcurl in docker images. r=selena

libcurl shipped with CentOS 6 doesn't support keepalive. This is causing
builds to burn.
Attachment #8627515 - Flags: review?(sdeckelmann) → review+
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

https://reviewboard.mozilla.org/r/12255/#review10733

Ship It!
(Assignee)

Comment 32

2 years ago
https://hg.mozilla.org/integration/b2g-inbound/rev/bba8e8d63c37
(In reply to Wander Lairson Costa [:wcosta] from comment #32)
> https://hg.mozilla.org/integration/b2g-inbound/rev/bba8e8d63c37

sorry had to back this out for perma failures like https://treeherder.mozilla.org/logviewer.html#?job_id=2225232&repo=b2g-inbound

Comment 34

2 years ago
Backout:
https://hg.mozilla.org/integration/b2g-inbound/rev/a16f198045ae

Updated

2 years ago
Flags: needinfo?(wcosta)
(Assignee)

Updated

2 years ago
Depends on: 1178899
(Assignee)

Comment 35

2 years ago
Emulator bustage was caused by Bug 1178899. Should be fixed now, I could run a successfully build:
https://tools.taskcluster.net/task-inspector/#a8ecyIG3QEuGN95D7BOXUg/2

Can we backout the backout?
Flags: needinfo?(wcosta) → needinfo?(cbook)
(Assignee)

Comment 36

2 years ago
Just talked to selena, we are going to that in other way.
Flags: needinfo?(cbook)
(Assignee)

Updated

2 years ago
Depends on: 1178997
(Assignee)

Comment 37

2 years ago
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

Bug 1177190: Update libcurl in docker images. r=selenamarie

libcurl on CentOS 6 doesn't support keealive, so we upgrade it.
The approach we take to avoid breaking buildbot machines is to
grab libcurl from CentOS 7, build it on CentOS 6 and upload rpms
to S3.
Attachment #8627515 - Attachment description: MozReview Request: Bug 1177190: Update libcurl in docker images. r=selena → MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie
Attachment #8627515 - Flags: review+ → review?(sdeckelmann)
(Assignee)

Updated

2 years ago
Attachment #8627515 - Flags: review?(sdeckelmann) → review?(dustin)
(Assignee)

Comment 38

2 years ago
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

Bug 1177190: Update libcurl in docker images. r=selenamarie

libcurl on CentOS 6 doesn't support keealive, so we upgrade it.
The approach we take to avoid breaking buildbot machines is to
grab libcurl from CentOS 7, build it on CentOS 6 and upload rpms
to S3.
(Assignee)

Comment 39

2 years ago
https://hg.mozilla.org/integration/b2g-inbound/rev/4bfe1c223646
https://reviewboard.mozilla.org/r/12253/#review10881

::: testing/docker/b2g-build/Dockerfile:18
(Diff revision 2)
> +  cd -

You should be able to just 'yum install $url' which avoids loading the RPMs onto disk
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

It'd be good to have some comments in there regarding why these aren't installed from a yum repo, too.

FWIW, there's another option to enforce keepalive for everything:
  http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#libkeepalive
why the linux kernel doesn't do this by default, I don't know.  Would the Internet collapse from an extra TCP round trip every 5 minutes?  The number of TCP connections that last that long is a vanishingly small portion of all TCP connections.  But I digress..
Attachment #8627515 - Flags: review?(dustin) → review+
https://hg.mozilla.org/mozilla-central/rev/4bfe1c223646
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox42: --- → fixed
Resolution: --- → FIXED

Updated

2 years ago
Flags: needinfo?(jocheng)
Flags: needinfo?(faramarz)
You need to log in before you can comment on or make changes to this bug.