10.6 talos jobs are starting to timeout since Dec. 9th

RESOLVED INCOMPLETE

Status

Release Engineering
Platform Support
RESOLVED INCOMPLETE
4 years ago
4 years ago

People

(Reporter: armenzg, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
On Dec. 9th [1], we started seeing a lot of talos timeouts [2] and it seems to be infra related.

13:07:57    ERROR -  Traceback (most recent call last):
13:07:57     INFO -    File "/tools/python27/lib/python2.7/threading.py", line 551, in __bootstrap_inner
13:07:57     INFO -      self.run()
13:07:57     INFO -    File "/tools/python27/lib/python2.7/threading.py", line 504, in run
13:07:57     INFO -      self.__target(*self.__args, **self.__kwargs)
13:07:57     INFO -    File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 710, in _processOutput
13:07:57     INFO -      self.onTimeout()
13:07:57     INFO -    File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/talosProcess.py", line 67, in onTimeout
13:07:57 CRITICAL -      raise talosError("timeout")
13:07:57 CRITICAL -  talosError: timeout

There's a talos change on the 3rd [3] that it would be unlikely to be related but we would like to be sure through a try push [4].

This is happening for 10.6 talos svgr and tp5o jobs.

I see it happening quite often on m-i:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&jobname=10.6.*talos
but not on Aurora or Beta:
https://tbpl.mozilla.org/?tree=Mozilla-Aurora&jobname=10.6.*talos
https://tbpl.mozilla.org/?tree=Mozilla-Beta&jobname=10.6.*talos

I see the tbpl push robot reporting a bunch of beta and aurora jobs, however, it's worth pointing out that the talos lines where the exceptions are happening do not match exactly what we see on trunk trees (different talos code?):
16:03:16    ERROR -  Traceback (most recent call last):
16:03:16     INFO -    File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/run_tests.py", line 277, in run_tests
16:03:16     INFO -      talos_results.add(mytest.runTest(browser_config, test))
16:03:16     INFO -    File "/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/ttest.py", line 407, in runTest
16:03:16 CRITICAL -      raise talosError("timeout exceeded")

Sources of external changes that could affect a talos job outside of in-code landings:
* http://hg.mozilla.org/build/puppet
* http://hg.mozilla.org/build/talos
* http://hg.mozilla.org/build/buildbot-configs/graph
* http://hg.mozilla.org/build/buildbotcustom/graph
* http://hg.mozilla.org/build/mozharness/log/default/mozharness/mozilla/testing/talos.py
* http://hg.mozilla.org/build/mozharness/log/d163222e0366/configs/talos/mac_config.py
* Deployed python packages (which should show up on mozharness code)

There were two reconfigs on that day:
http://hg.mozilla.org/build/buildbot-configs/rev/d6e1e8576ad5
http://hg.mozilla.org/build/buildbot-configs/rev/e2705dee682b
This landed on mozharness on that day:
http://hg.mozilla.org/build/mozharness/rev/1913406e2d96
The puppet changes did not seem relevant.
Nothing on buildbotcustom seems relevant.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=798219#c3067
[2] https://tbpl.mozilla.org/php/getParsedLog.php?id=32108592&tree=Mozilla-Inbound&full=1#error1
[3] http://hg.mozilla.org/build/talos/rev/2bcf422011d1
[4] https://tbpl.mozilla.org/?tree=Try&rev=e06e059534f9

Updated

4 years ago
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.