mozharness talos tpn busted on cedar on Linux and Windows: "Unable to proceed with missing counter 'tp5n_%cpu'"

RESOLVED FIXED

Status

--
major
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: emorley, Unassigned)

Tracking

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

6 years ago
https://tbpl.mozilla.org/php/getParsedLog.php?id=16493747&tree=Cedar
https://tbpl.mozilla.org/php/getParsedLog.php?id=16492196&tree=Cedar
https://tbpl.mozilla.org/php/getParsedLog.php?id=16492816&tree=Cedar
https://tbpl.mozilla.org/php/getParsedLog.php?id=16492755&tree=Cedar

eg:

{
09:45:18     INFO -  NOISE: Outputting talos results => {'results_urls': ['http://graphs.mozilla.org/server/collect.cgi'], 'datazilla_urls': ['https://datazilla.mozilla.org/talos']}
09:45:18     INFO -  DEBUG: Working with test: tp5n
09:45:18     INFO -  Generating results file: tp5n:
09:45:18     INFO -  		Started Fri, 26 Oct 2012 09:45:18
09:45:18     INFO -  Generating results file: tp5n:
09:45:18     INFO -  		Stopped Fri, 26 Oct 2012 09:45:18
09:45:18     INFO -  No results collected for: tp5n_%cpu:
09:45:18     INFO -  		Error Fri, 26 Oct 2012 09:45:18
09:45:18     INFO -  DEBUG: Working with test: tp5n
09:45:18     INFO -  Generating results file: tp5n:
09:45:18     INFO -  		Started Fri, 26 Oct 2012 09:45:18
09:45:18     INFO -  Generating results file: tp5n:
09:45:18     INFO -  		Stopped Fri, 26 Oct 2012 09:45:18
09:45:18     INFO -  No results collected for: tp5n_%cpu:
09:45:18     INFO -  		Error Fri, 26 Oct 2012 09:45:18
09:45:18     INFO -  FAIL: Unable to proceed with missing counter 'tp5n_%cpu'
09:45:18    ERROR -  Traceback (most recent call last):
09:45:18     INFO -    File "c:\talos-slave\test\build\venv\Scripts\talos-script.py", line 9, in <module>
09:45:18     INFO -      load_entry_point('talos==0.0', 'console_scripts', 'talos')()
09:45:18     INFO -    File "c:\talos-slave\test\build\venv\lib\site-packages\talos\run_tests.py", line 300, in main
09:45:18     INFO -      run_tests(parser)
09:45:18     INFO -    File "c:\talos-slave\test\build\venv\lib\site-packages\talos\run_tests.py", line 276, in run_tests
09:45:18     INFO -      talos_results.output(results_urls, **results_options)
09:45:18     INFO -    File "c:\talos-slave\test\build\venv\lib\site-packages\talos\results.py", line 89, in output
09:45:18     INFO -      raise e
09:45:18 CRITICAL -  talos.utils.talosError: "Unable to proceed with missing counter 'tp5n_%cpu'"
}
that is a new one for us.  I have seen this fail on tp5n_xperf_main_startup_netio and tp5n_xres.  %cpu is a new one!
(Reporter)

Comment 2

6 years ago
talos... the gift that keeps on giving!
Blocks: 806123

Comment 3

6 years ago
It looks like this isn't mozharness specific?
Is there something I can do to help this along?
Well, once every few days spread across every talos job that runs on every tree isn't mozharness specific, but every single time seems to be.

Comment 5

6 years ago
This may be related to bug 795531 on the mozharness side.

Comment 6

6 years ago
We have shut off making counters mandatory for the time being.  In light of that, should we keep this bug open?

Comment 7

6 years ago
We need to update the talos + other packages to pick up the workaround in comment 6.

Updated

6 years ago
Depends on: 823306

Comment 8

6 years ago
As Aki says, we do need to update talos + deps to get better here (bug 823306).  The current revision, 0e9224d7bc95, raises an error if we collect counters: http://hg.mozilla.org/build/talos/diff/524c6ff1736b/talos/output.py#l208 . However, because this happens all the time, we subsequently disabled this error as missing counters caused several intermittent bugs: http://hg.mozilla.org/build/talos/file/71f7f2ed08a7/talos/output.py#l208 . See https://bugzilla.mozilla.org/show_bug.cgi?id=812315 . Unfortunately, this has merely transformed this into a different intermittent: bug 812729 . 

So while we should update the packages and get on parity with what we use to test m-c, we're still going to have the graphserver error.  An alternative is to not care about graphserver for mozharness talos and go straight to datazilla.
I would like to say mozharness should only care about datazilla, but that would be putting the cart before the horse.

Technically we could do it and the UI will support it fine.  I still think we have about 2 months before all tests and data are stable, organized and validated in the datazilla UI.  Then we would need to hook up the regression emailer to it, or find a way to report failures.  That is one of the final steps, but until we have that at least well under way we shouldn't be talking about reporting to datazilla only.
(In reply to Joel Maher (:jmaher) from comment #9)
> I would like to say mozharness should only care about datazilla, but that
> would be putting the cart before the horse.
> 
> Technically we could do it and the UI will support it fine.  I still think
> we have about 2 months before all tests and data are stable, organized and
> validated in the datazilla UI.  Then we would need to hook up the regression
> emailer to it, or find a way to report failures.  That is one of the final
> steps, but until we have that at least well under way we shouldn't be
> talking about reporting to datazilla only.

Two months to get into datazilla - huh - is there anything we can do in the meanwhile? I'm asking because its unclear (at least to my quick read) what next steps are here, and if this really does block bug#713055 (talos-on-mozharness).
the next steps are for somebody who can update and debug this to figure out why we are getting failure to collect counters on mozharness only.  The same test harness works just fine on buildbot/tinderbox, so something is amiss in the land of mozharness.
I am not seeing this error on linux, only windows.  For linux we are timing out on the tp test, and looking at the logs there is this magical 20 minute void in the timestamps.  For windows, it would be nice if we could update mozharness to use the latest talos bits.
Unless I'm mistaken, we're no longer running tp5n and we can WONTFIX this bug !!!

If we do, however, we should open a new one for tp5o being busted across all platforms :\
All we did was adjust tp5 pageset and call it tp5o, this isn't a wontfix, the same harness is being run.  If you feel the issues are different, then go ahead and wontfix.  I do know that tp5o runs great on all our production platforms.

Comment 15

6 years ago
Apparently, this is fixed in bug 887479.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.