Closed Bug 951704 Opened 11 years ago Closed 10 years ago

10.6 talos jobs are starting to timeout since Dec. 9th

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: armenzg, Unassigned)

Details

On Dec. 9th [1], we started seeing a lot of talos timeouts [2] and it seems to be infra related.

13:07:57    ERROR -  Traceback (most recent call last):
13:07:57     INFO -    File "/tools/python27/lib/python2.7/threading.py", line 551, in __bootstrap_inner
13:07:57     INFO -      self.run()
13:07:57     INFO -    File "/tools/python27/lib/python2.7/threading.py", line 504, in run
13:07:57     INFO -      self.__target(*self.__args, **self.__kwargs)
13:07:57     INFO -    File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 710, in _processOutput
13:07:57     INFO -      self.onTimeout()
13:07:57     INFO -    File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/talosProcess.py", line 67, in onTimeout
13:07:57 CRITICAL -      raise talosError("timeout")
13:07:57 CRITICAL -  talosError: timeout

There's a talos change on the 3rd [3] that it would be unlikely to be related but we would like to be sure through a try push [4].

This is happening for 10.6 talos svgr and tp5o jobs.

I see it happening quite often on m-i:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=10.6.*talos
but not on Aurora or Beta:
https://tbpl.mozilla.org/?tree=Mozilla-Aurora&jobname=10.6.*talos
https://tbpl.mozilla.org/?tree=Mozilla-Beta&jobname=10.6.*talos

I see the tbpl push robot reporting a bunch of beta and aurora jobs, however, it's worth pointing out that the talos lines where the exceptions are happening do not match exactly what we see on trunk trees (different talos code?):
16:03:16    ERROR -  Traceback (most recent call last):
16:03:16     INFO -    File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/run_tests.py", line 277, in run_tests
16:03:16     INFO -      talos_results.add(mytest.runTest(browser_config, test))
16:03:16     INFO -    File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/ttest.py", line 407, in runTest
16:03:16 CRITICAL -      raise talosError("timeout exceeded")

Sources of external changes that could affect a talos job outside of in-code landings:
* http://hg.mozilla.org/build/puppet
* http://hg.mozilla.org/build/talos
* http://hg.mozilla.org/build/buildbot-configs/graph
* http://hg.mozilla.org/build/buildbotcustom/graph
* http://hg.mozilla.org/build/mozharness/log/default/mozharness/mozilla/testing/talos.py
* http://hg.mozilla.org/build/mozharness/log/d163222e0366/configs/talos/mac_config.py
* Deployed python packages (which should show up on mozharness code)

There were two reconfigs on that day:
http://hg.mozilla.org/build/buildbot-configs/rev/d6e1e8576ad5
http://hg.mozilla.org/build/buildbot-configs/rev/e2705dee682b
This landed on mozharness on that day:
http://hg.mozilla.org/build/mozharness/rev/1913406e2d96
The puppet changes did not seem relevant.
Nothing on buildbotcustom seems relevant.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=798219#c3067
[2] https://tbpl.mozilla.org/php/getParsedLog.php?id=32108592&tree=Mozilla-Inbound&full=1#error1
[3] http://hg.mozilla.org/build/talos/rev/2bcf422011d1
[4] https://tbpl.mozilla.org/?tree=Try&rev=e06e059534f9
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.