Intermittent [taskcluster:error] Task timeout after 14400 seconds. Force killing container.

RESOLVED WONTFIX

Status

RESOLVED WONTFIX
3 years ago
3 years ago

People

(Reporter: RyanVM, Unassigned)

Tracking

({intermittent-failure})

Details

Comment hidden (empty)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Summary: Intermittent [taskcluster] Error: Task timeout after 14400 seconds. Force killing container. → Intermittent [taskcluster:error] Task timeout after 14400 seconds. Force killing container.
Duplicate of this bug: 1204235
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
This is the #6 intermittent orange over the past 3 days.  It would be helpful to get this assigned appropriately (whether that's a fix in Taskcluster or better diagnostics that would lead to a fix elsewhere).
Flags: needinfo?(sdeckelmann)
Comment hidden (Intermittent Failures Robot)
ack, these failures are all git sync related, some examples:
-------------------------------------------------------------------------------
 Fetching project android-development

 Fetching project platform/external/liblzf

 Fetching project platform/external/iproute2

 Fetching project platform/abi/cpp

 Fetching project platform/prebuilts/gcc/linux-x86/host/x86_64-linux-glibc2.7-4.6

 [taskcluster-vcs:warning] run end (with error) try (10/20) retrying in 11629572.478532791 ms : ./repo sync -j1

 [taskcluster:error] Task timeout after 14400 seconds. Force killing container.

-------------------------------------------------------------------------------
^ this is the most common one


-------------------------------------------------------------------------------
 : export GIT_DIR=/home/worker/workspace/B2G/.repo/projects/external/icu4c.git

 : git fetch caf --tags +refs/heads/*:refs/remotes/caf/* 1>| 2>|

 : load refs /home/worker/workspace/B2G/.repo/projects/external/icu4c.git

 : scan refs /home/worker/workspace/B2G/.repo/projects/external/icu4c.git

 : parsing /home/worker/workspace/B2G/.repo/projects/external/aac.git/config

 

 : export GIT_DIR=/home/worker/workspace/B2G/.repo/projects/external/aac.git

 : git fetch caf --tags +refs/heads/*:refs/remotes/caf/* 1>| 2>|

 [taskcluster:error] Task timeout after 14400 seconds. Force killing container.
-------------------------------------------------------------------------------
^ this is not as common, but seen many times


-------------------------------------------------------------------------------
  * [new tag]         android-6.0.0_r7 -> android-6.0.0_r7

  * [new tag]         android-6.0.1_r1 -> android-6.0.1_r1

  * [new tag]         android-6.0.1_r3 -> android-6.0.1_r3

  * [new tag]         android-cts-6.0_r2 -> android-cts-6.0_r2

 : load refs /home/worker/workspace/B2G/.repo/projects/external/expat.git

 : scan refs /home/worker/workspace/B2G/.repo/projects/external/expat.git

 : git update-ref -m manifest set to 4a65b6cb6d2aae6bf952b0e6b44881f84fdf5128 --no-deref refs/remotes/m/master 4a65b6cb6d2aae6bf952b0e6b44881f84fdf5128^0 1>| 2>|

 : parsing /home/worker/workspace/B2G/.repo/projects/external/strace.git/config

 

 : export GIT_DIR=/home/worker/workspace/B2G/.repo/projects/external/strace.git

 : git fetch caf --tags +refs/heads/*:refs/remotes/caf/* 1>| 2>|

 [taskcluster:error] Task timeout after 14400 seconds. Force killing container.
-------------------------------------------------------------------------------
^ this is seen many times as well


:gps, you do a lot (at least blog about) vcs, can you help find the right person to look into why our git fetch is taking so long?
Flags: needinfo?(gps)
Git + repo have historically been a cluster$*#@.

One reason things frequently time out is because we're fetching an insane amount of data and repositories. There's a lot of surface area for things to go wrong.

On top of that, repo can be brain dead about retries. My understanding is it will aggressively purge old repos and resync from scratch. This has led to traffic floods against git.mo in the past, which makes clones fail, which makes jobs fail.

hwine, fubar, and/or catlee should be able to add more context, as they have all touched the git/repo side of things more than me.
Flags: needinfo?(gps)
Comment hidden (Intermittent Failures Robot)
(In reply to Joel Maher (:jmaher) from comment #94)
>
>  Fetching project
> platform/prebuilts/gcc/linux-x86/host/x86_64-linux-glibc2.7-4.6
> 
>  [taskcluster-vcs:warning] run end (with error) try (10/20) retrying in
> 11629572.478532791 ms : ./repo sync -j1
> 
>  [taskcluster:error] Task timeout after 14400 seconds. Force killing
> container.

offhand, the only odd thing I see is that i686-linux-glibc2.7-4.6.git, aac.git, and strace.git are
missing the remote default branch:

Cloning into 'i686-linux-glibc2.7-4.6'...
[...].
warning: remote HEAD refers to nonexistent ref, unable to checkout.

also:
git1.dmz.scl3# pwd
/var/lib/gitolite3/repositories/external/sprd-aosp/platform/external/strace.git
git1.dmz.scl3# git fsck
notice: HEAD points to an unborn branch (master)
Checking object directories: 100% (256/256), done.
Checking objects: 100% (343/343), done.

but that's probably a red herring? 

the other thing to look at is to make sure that TC is actually caching those repos.
we need to figure out a way to fail faster here, 3 hours and we are stuck on vcs cloning sounds horrible.

Either way, who should own this?  if gps says git+repo has a bad history, why not use a better vcs?  I know that isn't a conversation for this bug, but we have our top failures wasting every single developers time related to git, even if they don't use it.  

I know this is the holidays, but we shouldn't have to wait for Selena to get back in order to assign this to someone to fix this issue, unless Selena is the one who owns/wrote/maintains that part of the code.

catlee, can you find an owner and get this fixed?  (it appears hwine is on pto this week as well)
Flags: needinfo?(catlee)
Comment hidden (Intermittent Failures Robot)
I'm not familiar with how tc-vcs handles this. For buildbot, we've done some work to optimize git+repo usage, but I'm not sure if tc-vcs is doing the same thing.

Greg / Jonas / John - any suggestions here?
Flags: needinfo?(jopsen)
Flags: needinfo?(jhford)
Flags: needinfo?(garndt)
Flags: needinfo?(catlee)
The last time I looked into this I hit a dead end and came to the same conclusion that for some reason 'repo' was doing full clones of all the projects, even if there was a local copy available.  It wasn't clear to me at the time why it was being forced to do such a thing.

I spot checked a few of the recently starred tasks and there are some that already had a locally cached copy that tc-vcs pulled for a previous task, and some where it had to do a fresh checkout.  Both types of tasks resulted in the same outcome...timing out when running 'repo sync -j1'.

What were the things done to optimize git+repo on the buildbot side? Perhaps there are things we should be doing as well that could help.
Flags: needinfo?(garndt)
The code for this is here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/﷒0﷓

I don't think we run with -j1, so that may be a quick optimization.
We symlink the local working directory's .repo dir to a machine-local cache.
In the case of some of these tasks that timed out, .repo is already populated with a local directory cache from a previous task.  We attempt to do that as well after a task has already pulled it down the first time. 

re: -j1 
We originally had this at -j100 which was way too high and we went in the other drastic direction of making it -j1 because of the problems it was causing with git.m.o repeatedly. When repo gets into a state of wanting to do full clones, which happens on multiple machines around the same time, it creates quite the situation with git.m.o.  Since changing this to -j1 I have not had a report of taskcluster tasks causing major disruptions of git.m.o so I'm a little scared to change that, but if there is a more sensible option I'm always up for that. :)
Flags: needinfo?(jopsen)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Given that these primarily affect b2g emulator builds, I'm closing this bug. 

We have work underway to replace tc-vcs that will probably also fix issues involving repo, but this is no longer something our team will invest time in fixing.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(jhford)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.