Intermittent talos.PerfConfigurator.ConfigurationError: No definition found for test(s): ['glterrain']

RESOLVED WORKSFORME

Status

Testing
Talos
RESOLVED WORKSFORME
3 years ago
2 years ago

People

(Reporter: KWierso, Unassigned)

Tracking

({intermittent-failure})

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=43633562&full=1&branch=mozilla-inbound#error0

Ubuntu HW 12.04 x64 mozilla-inbound talos g1 on 2014-07-11 12:52:18 PDT for push 62f11352d198

slave: talos-linux64-ix-003



12:52:59     INFO - #####
12:52:59     INFO - ##### Running run-tests step.
12:52:59     INFO - #####
12:52:59     INFO - Running pre-action listener: _resource_record_pre_action
12:52:59     INFO - Running main action method: run_tests
12:52:59     INFO - Running command: ['/home/cltbld/talos-slave/test/build/venv/bin/python', '--version']
12:52:59     INFO - Copy/paste: /home/cltbld/talos-slave/test/build/venv/bin/python --version
12:52:59     INFO -  Python 2.7.3
12:52:59     INFO - Return code: 0
12:52:59     INFO - mkdir: /builds/slave/talos-slave/test/build/blobber_upload_dir
12:52:59     INFO - ENV: MOZ_UPLOAD_DIR is now /builds/slave/talos-slave/test/build/blobber_upload_dir
12:52:59     INFO - ENV: MINIDUMP_SAVE_PATH is now /builds/slave/talos-slave/test/build/blobber_upload_dir
12:52:59     INFO - Running command: ['/home/cltbld/talos-slave/test/build/venv/bin/talos', '--noisy', '--debug', '-v', '--executablePath', '/builds/slave/talos-slave/test/build/application/firefox/firefox', '--title', 'talos-linux64-ix-003', '--symbolsPath', 'https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1405105932/firefox-33.0a1.en-US.linux-x86_64.crashreporter-symbols.zip', '--activeTests', 'tp5o_scroll:glterrain', '--results_url', 'http://graphs.mozilla.org/server/collect.cgi', '--output', 'talos.yml', '--branchName', 'Mozilla-Inbound-Non-PGO', '--datazilla-url', 'https://datazilla.mozilla.org/talos', '--authfile', '/builds/slave/talos-slave/test/oauth.txt', '--webServer', 'localhost'] in /builds/slave/talos-slave/test/build
12:52:59     INFO - Copy/paste: /home/cltbld/talos-slave/test/build/venv/bin/talos --noisy --debug -v --executablePath /builds/slave/talos-slave/test/build/application/firefox/firefox --title talos-linux64-ix-003 --symbolsPath https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1405105932/firefox-33.0a1.en-US.linux-x86_64.crashreporter-symbols.zip --activeTests tp5o_scroll:glterrain --results_url http://graphs.mozilla.org/server/collect.cgi --output talos.yml --branchName Mozilla-Inbound-Non-PGO --datazilla-url https://datazilla.mozilla.org/talos --authfile /builds/slave/talos-slave/test/oauth.txt --webServer localhost
12:52:59     INFO - Using env: {'COLORTERM': 'gnome-terminal',
12:52:59     INFO -  'COMPIZ_CONFIG_PROFILE': 'ubuntu',
12:52:59     INFO -  'DBUS_SESSION_BUS_ADDRESS': 'unix:abstract=/tmp/dbus-Uw0aW9hid3,guid=5bd8f63d48d2a42783e33e1b0000003b',
12:52:59     INFO -  'DEFAULTS_PATH': '/usr/share/gconf/ubuntu.default.path',
12:52:59     INFO -  'DISPLAY': ':0',
12:52:59     INFO -  'GNOME_DESKTOP_SESSION_ID': 'this-is-deprecated',
12:52:59     INFO -  'GNOME_KEYRING_CONTROL': '/tmp/keyring-tfT1Qf',
12:52:59     INFO -  'GPG_AGENT_INFO': '/tmp/keyring-tfT1Qf/gpg:0:1',
12:52:59     INFO -  'HOME': '/home/cltbld',
12:52:59     INFO -  'LANG': 'en_US.UTF-8',
12:52:59     INFO -  'LANGUAGE': 'en_US:en',
12:52:59     INFO -  'LOGNAME': 'cltbld',
12:52:59     INFO -  'MAIL': '/var/mail/cltbld',
12:52:59     INFO -  'MANDATORY_PATH': '/usr/share/gconf/ubuntu.mandatory.path',
12:52:59     INFO -  'MINIDUMP_SAVE_PATH': '/builds/slave/talos-slave/test/build/blobber_upload_dir',
12:52:59     INFO -  'MOZ_CRASHREPORTER_NO_REPORT': '1',
12:52:59     INFO -  'MOZ_NO_REMOTE': '1',
12:52:59     INFO -  'MOZ_UPLOAD_DIR': '/builds/slave/talos-slave/test/build/blobber_upload_dir',
12:52:59     INFO -  'NODE_PATH': '/usr/lib/nodejs:/usr/lib/node_modules:/usr/share/javascript',
12:52:59     INFO -  'NO_EM_RESTART': '1',
12:52:59     INFO -  'PATH': '/home/cltbld/talos-slave/test/build/venv/bin:/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games',
12:52:59     INFO -  'PROPERTIES_FILE': '/builds/slave/talos-slave/test/buildprops.json',
12:52:59     INFO -  'PWD': '/builds/slave/talos-slave/test',
12:52:59     INFO -  'SESSION_MANAGER': 'local/talos-linux64-ix-003:@/tmp/.ICE-unix/2281,unix/talos-linux64-ix-003:/tmp/.ICE-unix/2281',
12:52:59     INFO -  'SHELL': '/bin/bash',
12:52:59     INFO -  'SHLVL': '1',
12:52:59     INFO -  'SSH_AGENT_PID': '2272',
12:52:59     INFO -  'SSH_AUTH_SOCK': '/tmp/keyring-tfT1Qf/ssh',
12:52:59     INFO -  'TERM': 'xterm',
12:52:59     INFO -  'TMOUT': '86400',
12:52:59     INFO -  'UBUNTU_MENUPROXY': 'libappmenu.so',
12:52:59     INFO -  'USER': 'cltbld',
12:52:59     INFO -  'WINDOWID': '18874406',
12:52:59     INFO -  'XDG_CONFIG_DIRS': '/etc/xdg/xdg-ubuntu:/etc/xdg',
12:52:59     INFO -  'XDG_CURRENT_DESKTOP': 'Unity',
12:52:59     INFO -  'XDG_DATA_DIRS': '/usr/share/ubuntu:/usr/share/gnome:/usr/local/share/:/usr/share/',
12:52:59     INFO -  'XDG_SESSION_COOKIE': '0badd6d79e82f792eed2a2c4000001d4-1405108320.64721-1627291359',
12:52:59     INFO -  'XPCOM_DEBUG_BREAK': 'warn',
12:52:59     INFO -  '_': '/etc/X11/Xsession',
12:52:59     INFO -  '__GL_YIELD': 'NOTHING'}
12:52:59     INFO - Calling ['/home/cltbld/talos-slave/test/build/venv/bin/talos', '--noisy', '--debug', '-v', '--executablePath', '/builds/slave/talos-slave/test/build/application/firefox/firefox', '--title', 'talos-linux64-ix-003', '--symbolsPath', 'https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1405105932/firefox-33.0a1.en-US.linux-x86_64.crashreporter-symbols.zip', '--activeTests', 'tp5o_scroll:glterrain', '--results_url', 'http://graphs.mozilla.org/server/collect.cgi', '--output', 'talos.yml', '--branchName', 'Mozilla-Inbound-Non-PGO', '--datazilla-url', 'https://datazilla.mozilla.org/talos', '--authfile', '/builds/slave/talos-slave/test/oauth.txt', '--webServer', 'localhost'] with output_timeout 3600
12:52:59    ERROR -  Traceback (most recent call last):
12:52:59     INFO -    File "/home/cltbld/talos-slave/test/build/venv/bin/talos", line 9, in <module>
12:52:59     INFO -      load_entry_point('talos==0.0', 'console_scripts', 'talos')()
12:52:59     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/run_tests.py", line 346, in main
12:52:59     INFO -      options, args = parser.parse_args(args)
12:52:59     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/PerfConfigurator.py", line 590, in parse_args
12:52:59     INFO -      options, args = Configuration.parse_args(self, *args, **kwargs)
12:52:59     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/configuration.py", line 476, in parse_args
12:52:59     INFO -      self(*config)
12:52:59     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/PerfConfigurator.py", line 401, in __call__
12:52:59     INFO -      return Configuration.__call__(self, *args)
12:52:59     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/configuration.py", line 380, in __call__
12:52:59     INFO -      self.validate()
12:52:59     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/PerfConfigurator.py", line 561, in validate
12:52:59     INFO -      self.config.setdefault('tests', []).extend(self.tests(activeTests, overrides, global_overrides, counters))
12:52:59     INFO -    File "/home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/talos/PerfConfigurator.py", line 670, in tests
12:52:59     INFO -      raise ConfigurationError("No definition found for test(s): %s" % missing)
12:52:59     INFO -  talos.PerfConfigurator.ConfigurationError: No definition found for test(s): ['glterrain']
12:52:59    ERROR - Return code: 1
12:52:59    ERROR - # TBPL WARNING #
12:52:59     INFO - Running post-action listener: _resource_record_post_action
12:52:59     INFO - Running post-run listener: _resource_record_post_run
(Reporter)

Comment 1

3 years ago
Happened twice earlier that were starred as infra:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43621638&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=43623233&tree=Mozilla-Inbound
this is really odd.  From the logs it appears we are pulling the correct talos.json which references the correct revision.  knowing that, I can think of some thing which could be wrong:
1) we have a cached version of talos on the box and something is in conflict
2) hg.mozilla.org/build/talos has issues or some kind of corruption
3) we fail to update talos to the revision and never print an error or warning

Comment 3

3 years ago
So, this intermittent failure will almost certainly disappear soon by itself - because the caches will get updated, etc.

However, if there's a real issue, and it looks to me like there is, then it will not go away by itself, and could bite us the next time we push a change, and we might even never notice it if the change doesn't result in a failure (e.g. if we only refine an existing test instead of adding a new one - where the latter could fail due to updated test but the former won't fail).

If we're serious about really fixing it, I suggest to create a new dummy test, and each time we update talos.json, use a new name for it (dummy2/3/etc).

This was the failure would manifest whenever we bump into this bug, and it will also help us notice that it's gone (after we fix the bug, but keep changing the dummy names but the failures won't happen).a
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Joel, considering that this doesn't seem to go away, is there any chance it's more than "not fully updated talos, maybe due to cache"?

Maybe we missed something with the patch for this test, either in the test itself (probably at test.py) or while defining the new group 'g1'?

Would anyone else be able able to help/review the patches and offer maybe more hypotheses?
Flags: needinfo?(jmaher)
Also, can we tell at what percentages do we encounter this issue? i.e. we know we've seen the issue 9 times already (comments 4 - 12), how many times did the test succeed without encountering this issue?

Can we correlate this issue to specific platforms, talos bots, or anything else?
so some of these are on linux32, but the large majority is linux64.

I really don't understand how this could randomly fail with this error, but if you look at the slaves this error shows up on:
http://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&tree=trunk&startday=2014-07-08&endday=2014-07-15&bugid=1037619

I see:
talos-linux64-ix-092
talos-linux64-ix-008
talos-linux64-ix-118
talos-linux64-ix-003

and from the above comments:
talos-linux32-ix-001

I suspect it is bad slaves- we keep seeing the same slave names repeating themselves.


:pmoore, what can we do with suspected bad slaves?  Is there a reimaging process?  maybe more diagnostics?  I would like to query these 5 slaves and see if they have had success running talos-g1
Flags: needinfo?(jmaher) → needinfo?(pmoore)
Hey Joel,

I also agree it would make sense to find the root cause of the problems, rather than only reimaging the slaves. One option is to request these specific slaves as loan requests.

I've seen emails about some delays between web heads syncing on hg.mozilla.org - it is a long shot, but might be worth seeing if maybe there was some lag there that caused the problem (could check with hwine or fubar what symptoms would be).

Otherwise, if you'd like some releng assistance, maybe we could have a deep dive together to have a look - let me know if you fancy that.

Pete
Flags: needinfo?(pmoore)
If I understand correctly, this issue is between talos.json requesting to execute the glterrain test (which indicates talos.json is recent because this test was added recently) and talos not containing or identifying this test (I'm guessing at least test.py is outdated).

AFAIK talos is updated according to talos.zip or mozharness (not sure which applies here).

Can we request the slaves to log the versions of the components in use (specifically right now we'll need talos.json "version" and the talos changeset in use).

Maybe it could help us understand what's causing this?
Actually for desktop we clone talos and update it to the revision in talos.json:
http://dxr.mozilla.org/mozilla-central/source/testing/talos/talos.json#8

here is the mozharness script:
http://hg.mozilla.org/build/mozharness/file/02c564f50818/mozharness/mozilla/testing/talos.py#l471

I get lost in mozharness land, but that is something we should look into.  I ran into a problem in April where we had a version mismatch of libraries that mozharness (actually mozinstall) depended on which was cached locally, so we would get a version mismatch.  The solution was to make talos depend on mozinstall (which is ridiculous because talos doesn't depend on mozinstall, mozharness depends on it).

what is scary is that we make modifications to our talos code and then get inconsistent results when run on slaves.  This missing test is the second instance of that and it makes it difficult to file regression bugs when there is randomness inserted into the puzzle.
Comment hidden (Treeherder Robot)
So if you request these specific slaves in a loan request, the only changes we make to them before handing them out is removing key security files, changing passwords (system users and vnc) and then hand it out - so then you'd get the slaves in the current state they are in, and you can investigate them. When you hand them back, they get reimaged, so this would kill two birds with one stone: you could investigate the problems, and when you give them back, you know they'll get cleaned up.

If you like this idea, just raise a bug (I see no reason to raise multiple - one should do, that lists the slaves) and that will get picked up by Simone who is on build duty this week.

If you fancy deep diving the issue together (a mozharness safari, so to speak), I'm happy to join.

Thanks,
Pete
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Pete Moore [:pete][:pmoore] from comment #20)
> So if you request these specific slaves in a loan request, the only changes
> we make to them before handing them out is removing key security files,
> changing passwords (system users and vnc) and then hand it out - so then
> you'd get the slaves in the current state they are in, and you can
> investigate them. When you hand them back, they get reimaged, so this would
> kill two birds with one stone: you could investigate the problems, and when
> you give them back, you know they'll get cleaned up.
> 
> If you like this idea, just raise a bug (I see no reason to raise multiple -
> one should do, that lists the slaves) and that will get picked up by Simone
> who is on build duty this week.
> 
> If you fancy deep diving the issue together (a mozharness safari, so to
> speak), I'm happy to join.
> 
> Thanks,
> Pete

Hey Joel,

How would you like to proceed?

Thanks,
Pete
Flags: needinfo?(jmaher)
Pete, can I get a loaner of talos-linux64-ix-118 ?  I think we need to investigate why these specific machines are problematic.
Flags: needinfo?(jmaher)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Even after this issue is fixed (which right now seems to be tied specifically to 3 machines), how can we make sure it doesn't happen again? now it fails due the update being such that a partially updated bot fails, but other updates might not fail.

Maybe add some directory checksums/hashes/something to the log before the talos run starts? So if something is not fully updated we could notice it at the logs, or even raise some flag automatically?
can I get a loaner of talos-linux-ix-118?  Also do you have ideas on how to address Avi's concerns in comment 42?
Flags: needinfo?(pmoore)
Depends on: 1042738
(In reply to Joel Maher (:jmaher) from comment #43)
> can I get a loaner of talos-linux-ix-118?

I've created loan request in bug 1042738.

> Also do you have ideas on how to address Avi's concerns in comment 42?

I'd say this issue is complex enough it warrants having a vidyo chat about it - let's say Avi, you, me, and anyone else who is interested? I think a deep dive will be best, at least for me - so I can get up-to-speed on the issue, and throw all my questions to you at once. :)

Maybe Armen might also be interested in getting involved?
Flags: needinfo?(pmoore)
Armen, would this also be interesting for you? (comment 44)
Flags: needinfo?(armenzg)
how about we do this after we determine the root cause for this specific issue :)

Comment 47

3 years ago
I'm not interested on getting myself deep into it.

However, I have observed some things that might be useful:
* We clobber this: 
23:37:16     INFO - retry: Calling <function rmtree at 0x9f9a764> with args: ('/builds/slave/talos-slave/test/build',), kwargs: {}, attempt #1
* However, the path to the talos venv is this:
23:37:16     INFO -  'virtualenv_path': '/home/cltbld/talos-slave/test/build/venv',
* This can be seen here:
23:37:34     INFO - Virtualenv /home/cltbld/talos-slave/test/build/venv/bin/python appears to already exist; skipping virtualenv creation.

It seems that we copy the manifest to the right location:
23:37:38     INFO - Copying /builds/slave/talos-slave/talos-data/talos/page_load_test/tp5n/tp5o.manifest to /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/page_load_test/tp5n/tp5o.manifest

I don't think clobbering the wrong thing should be the cause but there might be some connection I can't picture right now.
Flags: needinfo?(armenzg)

Updated

3 years ago
Blocks: 1023496

Updated

3 years ago
No longer blocks: 1023496
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Instead of going away, it seems to be "spreading" to new machines. So far it's been talos-linux64-ix-001/003/018, but yesterday (comment 66) we saw 004 for the first time, and now 092 at comment 70 ... :/

Something is off here...
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment 105 is different than the rest.

Either it doesn't belong here, or these are two different issues happened to manifest on one build, or we got new symptoms:

(In reply to TBPL Robot from comment #105)
> RyanVM
> https://tbpl.mozilla.org/php/getParsedLog.php?id=45348426&tree=B2g-Inbound
> Ubuntu HW 12.04 b2g-inbound pgo talos g1 on 2014-08-06 10:41:36
> revision: b5c072f24e08
> slave: talos-linux32-ix-001
> 
> abort: error: _ssl.c:504: EOF occurred in violation of protocol
> Automation Error: hg not responding
> Return code: 255
> abort: error: _ssl.c:504: EOF occurred in violation of protocol
> Automation Error: hg not responding
> Return code: 255
> abort: error: _ssl.c:504: EOF occurred in violation of protocol
> Automation Error: hg not responding
> Return code: 255
> Traceback (most recent call last):
> talos.PerfConfigurator.ConfigurationError: No definition found for test(s):
> ['glterrain']
> Return code: 1
> # TBPL WARNING #
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to TBPL Robot from comment #114)
> RyanVM
> https://tbpl.mozilla.org/php/getParsedLog.php?id=45452047&tree=Fx-Team
> Ubuntu HW 12.04 x64 fx-team pgo talos g1 on 2014-08-07 11:45:58
> revision: 273d6c900fb1
> slave: talos-linux64-ix-003
> 
> abort: HTTP Error 500: Internal Server Error
> Automation Error: hg not responding
> Return code: 255
> abort: HTTP Error 500: Internal Server Error
> Automation Error: hg not responding
> Return code: 255
> Traceback (most recent call last):
> talos.PerfConfigurator.ConfigurationError: No definition found for test(s):
> ['glterrain']
> Return code: 1
> # TBPL WARNING #


Another different message with the same "undefined test" error.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
the problem here seems to be that we are using a cached version of the talos module:
https://pastebin.mozilla.org/6084544

Related to this- is the solution here to add --upgrade all the time?  Why in the failure case do we have a special setup.py?

Callek, jlund- do you guys know what we can do here?  This is a recurring problem.
Flags: needinfo?(jlund)
Flags: needinfo?(bugspam.Callek)
(In reply to Joel Maher (:jmaher) from comment #184)
> the problem here seems to be that we are using a cached version of the talos
> module:
> https://pastebin.mozilla.org/6084544

Failing log:

> 04:42:38     INFO -  Unpacking ./talos_repo
> 04:42:40     INFO -    Running setup.py (path:/tmp/pip-mfvQHn-build/setup.py) egg_info for package from file:///builds/slave/talos-slave/test/build/talos_repo


working log:

08:12:49     INFO -  Unpacking ./talos_repo
08:12:49     INFO -    Running setup.py egg_info for package from file:///builds/slave/talos-slave/test/build/talos_repo
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
I was not able to look at this. I am on PTO now until wed. I may look at it before then but I can't promise that. :)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Blocks: 977306
Blocks: 877667
Blocks: 895422
Blocks: 937143
Blocks: 974448
So, we see this on talos-linux32-ix-001 which was reimaged on 2014-07-09, on talos-linux64-ix-003 which might have been reimaged on 2014-06-24 or maybe on 2014-07-09, hard to tell, on talos-linux64-ix-004 which was reimaged on 2014-06-24, on talos-linux64-ix-008 which was reimaged on 2014-07-09, and on talos-linux64-ix-092 which was reimaged on 2014-07-07.

Did this talos suite require that some change be made to the slaves, some change which the slaves reimaged during those couple of weeks managed to evade?
(In reply to Phil Ringnalda (:philor) from comment #213)
> So, we see this on talos-linux32-ix-001 which was reimaged on 2014-07-09, on
> talos-linux64-ix-003 which might have been reimaged on 2014-06-24 or maybe
> on 2014-07-09, hard to tell, on talos-linux64-ix-004 which was reimaged on
> 2014-06-24, on talos-linux64-ix-008 which was reimaged on 2014-07-09, and on
> talos-linux64-ix-092 which was reimaged on 2014-07-07.
> 
> Did this talos suite require that some change be made to the slaves, some
> change which the slaves reimaged during those couple of weeks managed to
> evade?

A new test (glterrain - which fails to have a definition on this bug) was added to talos on July 1st - bug 1020663 comment 40.

Don't the slaves get updated with the proper talos revision on every talos run?
Also, it happens that stale talos on this specific update results in explicit failure. But for most talos updates, being stale just means running an outdated code without explicit failures. How can we tell that it doesn't happen elsewhere quietly?
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
joel, I see that talos-linux-ix-118 (bug 1042738) is still on loan to you for this. Has there been any luck or new knowledge from that?

So I see there being two requirements for this bug to be resolved:

1) find out why we occasionally have the wrong talos version
    a) since it fails from missing 'glterrain' (comment 214) and when we use the cache we fail (comment 185), it seems likely that the culprit is our caching logic or a subset of slaves being in a corrupt state WRT cache path.
    b) poke the known slaves this happens to:  talos-linux64-ix-008, talos-linux32-ix-001, talos-linux64-ix-003, talos-linux64-ix-004, talos-linux64-ix-092, and talos-linux64-ix-092. I'm assuming these will have pre-july-1st as thats when glterrain was added. But we should also check if they are passing too.

2) find out if we can verify we have the correct talos update (for the cache and non cache case)
    a) log an error if we have a mismatch so we can catch all those instances and not just where we would fail if talos was out of date.

if this bug is high up on intermittent failures, I can shift my priorities and make time for this.
Flags: needinfo?(jmaher)
Flags: needinfo?(jlund)
Flags: needinfo?(bugspam.Callek)
(In reply to Avi Halachmi (:avih) from comment #14)
> Also, can we tell at what percentages do we encounter this issue? i.e. we
> know we've seen the issue 9 times already (comments 4 - 12), how many times
> did the test succeed without encountering this issue?
> 
> Can we correlate this issue to specific platforms, talos bots, or anything
> else?

I am late to the party so this might already be known but here are some findings:

sample data:
    - talos-linux64-ix-008, talos-linux32-ix-001, talos-linux64-ix-003, talos-linux64-ix-004, talos-linux64-ix-092, and talos-linux64-ix-118

results (from eyeballing):
    - every single instance of this bug has been the result of the 'talos g1' variant
    - it is not limited by branch
    - none of them have passed a 'talos g1' job as far back as slave health goes.
hmm, I'm having a hard time spotting the difference between a 'healthy slave' and an 'unhealthy one'

I can see that /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/ on an unhealthy one is old and not been updated since at least july 11th (http://hg.mozilla.org/mozilla-central/rev/11f7830bc276) as it has no: 'class glterrain(PageloaderTest):' in test.py

how can I see what revision /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages/talos is based off of? Bearing my ignorance, should we be versioning talos module since we are managing it with pip. It looks like it is always 0.0? 


It looks like we currently 'pip install' talos with the exact same cmd whether it's on a healthy machine or not for g1 talos. Further, our 'cache' (/home/cltbld/talos-slave/test/build/venv/cache) does not have any references of talos so I am not sure where we are getting the following from the unhealthy case:

13:07:26     INFO -    Running setup.py (path:/tmp/pip-aXV9in-build/setup.py) egg_info for package from file:///builds/slave/talos-slave/test-pgo/build/talos_repo
13:07:27     INFO -    Requirement already satisfied (use --upgrade to upgrade): talos==0.0 from file:///builds/slave/talos-slave/test-pgo/build/talos_repo in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages


the only clue I can go by is maybe a difference in pip that changes the behavior when used with similar args


HEALTHY:
[cltbld@talos-linux64-ix-095 venv]$ pwd
/home/cltbld/talos-slave/test/build/venv
[cltbld@talos-linux64-ix-095 venv]$ source bin/activate && pip --version
pip 0.8.2 from /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages/pip-0.8.2-py2.7.egg (python 2.7)

UNHEALTHY:
[cltbld@talos-linux64-ix-008 venv]$ pwd
/home/cltbld/talos-slave/test/build/venv
[cltbld@talos-linux64-ix-008 venv]$ source bin/activate && pip --version
pip 1.5.5 from /home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/pip-1.5.5-py2.7.egg (python 2.7)

I wonder if we will find 1.5.5 for all broken slaves. did we change this recently somewhere and it is getting picked up with post re-images? if it is something like this, philor was right.
Comment hidden (Treeherder Robot)
jlund,

on my loaner slave 118, I see this:
[cltbld@talos-linux64-ix-118 ~]$ cd /home/cltbld/talos-slave/test/build/venv
[cltbld@talos-linux64-ix-118 venv]$ source bin/activate && pip --version
pip 1.5.5 from /home/cltbld/talos-slave/test/build/venv/local/lib/python2.7/site-packages/pip-1.5.5-py2.7.egg (python 2.7)
(venv)[cltbld@talos-linux64-ix-118 venv]$ 


Maybe the newer pip version is the root cause of this.  We can version talos.  I just thought with us checking it out and running setup that there wouldn't be a need for that.

I want to confirm that the pip version and how it caches stuff is most likely the root cause here before embarking on talos versioning.
Flags: needinfo?(jmaher) → needinfo?(jlund)
another thought is if I bump the talos version each time (http://hg.mozilla.org/build/talos/file/49b74c08dad4/setup.py#l11), how do we reference that and ensure that we are requiring the latest version?

I assume we could define that in http://dxr.mozilla.org/mozilla-central/source/testing/talos/talos.json, but right now we just define revisions.

lets first confirm what the root of the problem is and then sort out solutions- I assume the pip version is the problem.
Comment hidden (Treeherder Robot)
> lets first confirm what the root of the problem is and then sort out
> solutions- I assume the pip version is the problem.

we bumped pip from 0.8.2 to 1.5.5 on June 18th, 2014: http://hg.mozilla.org/build/puppet/rev/158dc4354ed3 so that was pretty recent and something like the addition of glterrain (july 11th) was likely the first thing to bring light to the fact that talos is not updating on slaves with 1.5.5

I was not able to find a working linux with 1.5.5. IMO this is strong evidence to suggest pip version is the culprit.

not working:
talos-linux64-ix-004 -> 1.5.5
talos-linux64-ix-008 -> 1.5.5
talos-linux64-ix-118 -> 1.5.5
talos-linux64-ix-001 -> 1.5.5
talos-linux64-ix-003 -> 1.5.5
talos-linux64-ix-092 -> 1.5.5

working:
talos-linux32-ix-055 -> 0.8.2
talos-linux64-ix-119 -> 0.8.2
talos-linux32-ix-005 -> 0.8.2
talos-linux64-ix-017 -> 0.8.2
talos-linux32-ix-048 -> 0.8.2
talos-linux64-ix-086 -> 0.8.2

Interestingly, I did find mac osx pip versions working for g1 with 1.5.5:

t-snow-r4-0151 -> 1.5.5
t-snow-r4-0134 -> 1.5.5

I poked at a few windows and they seem to be at 0.8.2:
t-xp32-ix-062 -> 0.8.2
t-w732-ix-012 -> 0.8.2

my guess of what is happening (at least in the linux case) is if a host has not been re-imaged lately, /home/cltbld/talos-slave/test/build/venv will continue to exist so we don't need to install pip. If that does not exist, we create new venv + pip and we now use 1.5.5 as per puppet patch above.

So with that, I think the next thing to do is find out if
1) we have been using pip incorrectly all along and it was only caught by 1.5.5
2) 1.5.5 introduced new behavior and we need to update the way we use it
2) 1.5.5 has a bug and we need to stop using it
Flags: needinfo?(jlund)
BTW --download-cache /home/cltbld/talos-slave/test/build/venv/cache probably does nothing for the talos module as talos is not in that cache.

I suspect what is happening is that pip 1.5.5 says that we need 0.0 and /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages/talos meets that requirement so we don't bother updating it to the talos we got from talos.json

from log:
"13:07:27     INFO -    Requirement already satisfied (use --upgrade to upgrade): talos==0.0 from file:///builds/slave/talos-slave/test-pgo/build/talos_repo in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages"
jlund, so the question then becomes how do we convert the system in place to require a new talos version?  Do I need to upload/publish this?  Will it work different for the pandas which use talos.zip?

One thought is we release a talos package and somehow upload it to pypi (the internal one).  Then we fix talos.json to require a different talos package version.

In reality I would like to avoid this talos package stuff, it is nice in a few ways, but for developers who want to investigate stuff figuring out which version to use and somehow getting it seems like it creates more problems.

Is there a way we can add a --upgrade to the mozharness script and it would force a fresh talos cache?
Flags: needinfo?(jlund)

Updated

3 years ago
Depends on: 1061850
(In reply to Joel Maher (:jmaher) from comment #227)
> jlund, so the question then becomes how do we convert the system in place to
> require a new talos version?  Do I need to upload/publish this?

suppose it depends on whether we version talos or change mozharness script. I filed 1061850 to track that effort.

> Will it work different for the pandas which use talos.zip?

not sure how that works. will have to look into it

> 
> One thought is we release a talos package and somehow upload it to pypi (the
> internal one).  Then we fix talos.json to require a different talos package
> version.


that sounds like a reasonable solution but I'm assuming we would have a *lot* of versions needing to be uploaded as we work more of minor revisions than major releases with talos. Don't know if that's really a bad thing.

> In reality I would like to avoid this talos package stuff, it is nice in a
> few ways, but for developers who want to investigate stuff figuring out
> which version to use and somehow getting it seems like it creates more
> problems

true. we do checkout talos_repo everytime anyway right? do we really need it as a pip module or can we just 'hg update {rev}' and use talos_repo itself?

> 
> Is there a way we can add a --upgrade to the mozharness script and it would
> force a fresh talos cache?

if we pass --upgrade to our talos pip call:
pip install --download-cache /home/cltbld/talos-slave/test/build/venv/cache --timeout 120 --no-index --find-links http://pypi.pvt.build.mozilla.org/pub --find-links http://pypi.pub.build.mozilla.org/pub /builds/slave/talos-slave/test-pgo/build/talos_repo

will that also upgrade the inner modules to this?

Unpacking ./talos_repo
  Running setup.py (path:/tmp/pip-gtD7Tf-build/setup.py) egg_info for package from file:///builds/slave/talos-slave/test-pgo/build/talos_repo
  Requirement already satisfied (use --upgrade to upgrade): talos==0.0 from file:///builds/slave/talos-slave/test-pgo/build/talos_repo in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages
Requirement already satisfied (use --upgrade to upgrade): PyYAML in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): mozlog==1.5 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): mozcrash==0.9 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): mozdevice==0.26 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): mozfile==1.1 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): mozhttpd==0.5 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): mozinfo==0.7 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): datazilla==1.4 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): moznetwork==0.24 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): mozprocess==0.13 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): mozinstall==1.6 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): httplib2 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Requirement already satisfied (use --upgrade to upgrade): oauth2 in /home/cltbld/talos-slave/test/build/venv/lib/python2.7/site-packages (from talos==0.0)
Cleaning up...
eturn code: 0

will have to look at how that works I suppose
Flags: needinfo?(jlund)
Summary: Intermittent talos.PerfConfigurator.ConfigurationError: No definition found for test(s): ['glterrain'] → ta

Updated

3 years ago
Summary: ta → Intermittent talos.PerfConfigurator.ConfigurationError: No definition found for test(s): ['glterrain']
(In reply to Jordan Lund (:jlund) from comment #225)
> my guess of what is happening (at least in the linux case) is if a host has
> not been re-imaged lately, /home/cltbld/talos-slave/test/build/venv will
> continue to exist so we don't need to install pip. If that does not exist,
> we create new venv + pip and we now use 1.5.5 as per puppet patch above.


Afaict this should not have been the case, and is quite confusing seeing your data, given:

http://hg.mozilla.org/build/puppet/rev/9a0ece91bbea
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Justin Wood (:Callek) from comment #229)
> (In reply to Jordan Lund (:jlund) from comment #225)
> > my guess of what is happening (at least in the linux case) is if a host has
> > not been re-imaged lately, /home/cltbld/talos-slave/test/build/venv will
> > continue to exist so we don't need to install pip. If that does not exist,
> > we create new venv + pip and we now use 1.5.5 as per puppet patch above.
> 
> 
> Afaict this should not have been the case, and is quite confusing seeing
> your data, given:
> 
> http://hg.mozilla.org/build/puppet/rev/9a0ece91bbea

callek, not sure what we should be expecting? should we not have 1.5.5 yet? callek, jmaher: could we meet about this bug? I'd like to help move this along but I think we'd benefit from a collective understanding of state
Flags: needinfo?(jmaher)
Flags: needinfo?(bugspam.Callek)
Comment hidden (Treeherder Robot)
I am going to investigate (in the near future) how we could version talos vs adding --upgrade.  Also we need to figure out how to only use the venv talos or only the checked out talos.  It seems we have two versions
Flags: needinfo?(jmaher)
Flags: needinfo?(bugspam.Callek)
met with jmaher and the result was comment 234. I've disabled talos-linux64-ix-008 and talos-linux64-ix-004 in the meantime to buy us some time and create less of a head ache for sheriffs.
Comment hidden (Treeherder Robot)
Depends on: 1112773
No longer blocks: 977306
No longer blocks: 974448
No longer blocks: 877667
No longer blocks: 937143
No longer blocks: 895422
Inactive; closing (see bug 1180138).
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.