Closed Bug 960814 Opened 9 years ago Closed 8 years ago

Sometimes gaia integration tests timeout while fetching code from hg

Categories

(Testing Graveyard :: JSMarionette, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gaye, Assigned: jgriffin)

References

Details

Attachments

(1 file)

Flags: needinfo?(jgriffin)
Looking at lots of logs, it seems like the time to clone the gaia repo is highly variable...on successful jobs, I see it range from 6 to 19 minutes, so it's not hard to imagine it taking 20+ minutes on some occasions, which causes buildbot to time it out.

Aki, is there some way we could specify a larger timeout than 1200s, and would we want to? AFAICT, it's only possible to set this per platform, which would mean we'd affect all tests on b2g desktop builds.

Ideally, though, we'd be able improve the time it takes to clone or otherwise access the gaia repo from the slaves, so that it never even approaches 20 minutes, which seems pretty excessive.  Aki or Hal may have some ideas here.
Flags: needinfo?(jgriffin)
Flags: needinfo?(hwine)
Flags: needinfo?(aki)
a) we can specify a 'timeout' per test suite, as well as a script_maxtime (timeout is an idle timeout; script_maxtime is max total runtime):
http://hg.mozilla.org/build/buildbotcustom/file/bdcd73d0ea88/misc.py#l609

b) in bug 920161, I solved the talos clone timeout by:

* creating a virtualenv before the clone, with mozprocess
* cloning with an output_timeout set in the repo definition (or, alternately, a global vcs_output_timeout, as long as the virtualenv is created before any mercurial call):
http://hg.mozilla.org/build/mozharness/file/ba84049f96a0/mozharness/base/vcs/mercurial.py#l200
* making sure that output timeout is shorter than the buildbot timeout, with a retry
* re-running create_virtualenv() afterwards, since we install talos from the clone.  This only installs new packages, and ignores already installed modules.

This multi-step virtualenv creation is a headache, but that's the current solution, as long as mozprocess is required to have a timeout for the vcs steps.  This is largely why I've pushed back on having mozharness depend on any outside packages.  If it continues this way, however, a 2-step virtualenv may become SOP.
Flags: needinfo?(aki)
I'll make a change to increase the timeout on these tests for now.

The change to Talos is a bit different than what we want here (or maybe we want that in addition to something else); in that case, Talos is a small repo and if it takes 20 minutes, it's very likely hung.  But Gaia is a large repo, and seems to sometimes take 20 minutes without being hung; we'd like to give it a longer chance to complete before we kill it.

I vaguely remember some discussion in the past about hosting hg clones on network shares in order to avoid long clone times...was I imagining that, or is there some way we could do that instead of needing to clone Gaia on the slaves?
(In reply to Jonathan Griffin (:jgriffin) from comment #3)
> I vaguely remember some discussion in the past about hosting hg clones on
> network shares in order to avoid long clone times...was I imagining that, or
> is there some way we could do that instead of needing to clone Gaia on the
> slaves?

I think if the vcs_share_base is set in the config, or HG_SHARE_BASE_DIR is set in the env, it will use that directory for the |hg share| command.  I don't think we have an hg share directory set for test slaves, though.  We may want to look into creating one.
Flags: needinfo?(hwine)
This will give the clone an additional 10 minutes; I should also write a patch similar to what you did for Talos to yield better error handling, which I'll do separately.
Attachment #8361886 - Flags: review?(aki)
Assignee: nobody → jgriffin
Attachment #8361886 - Flags: review?(aki) → review+
in production
Still seeing some of these... should we go for the talos headache? Could we just reuse that code?
Flags: needinfo?(jgriffin)
Yes, I'll implement the Talos hack here too.
Flags: needinfo?(jgriffin)
(In reply to Jonathan Griffin (:jgriffin) from comment #9)
> Yes, I'll implement the Talos hack here too.

Actually, in light of https://bugzilla.mozilla.org/show_bug.cgi?id=920153#c501, I think we should pursue setting up a gaia-central share so we don't need to clone the full repo on each test slave.  I'll file a separate bug for this.
Depends on: 964411
I haven't seen this since bug 964411 went into production, but I'll leave it open a bit longer in case.
I haven't seen any occurrences of this recently; will reopen or file a new bug if it resurfaces.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.