Intermittent "talosError: Unable to proceed with missing counter 'tp5n_main_startup_fileio'" or "talosError: Unable to proceed with missing counter 'tp5n_main_shutdown_fileio'"

RESOLVED WORKSFORME

Status

defect
RESOLVED WORKSFORME
7 years ago
7 years ago

People

(Reporter: philor, Unassigned)

Tracking

({intermittent-failure})

Trunk
x86
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox18 unaffected, firefox19 affected)

Details

Attachments

(1 attachment)

+++ This bug was initially created as a clone of Bug #802289 +++

https://tbpl.mozilla.org/php/getParsedLog.php?id=16255092&tree=Mozilla-Inbound
Rev3 WINNT 6.1 mozilla-inbound talos xperf on 2012-10-18 20:18:51 PDT for push ee0b7a687708
slave: talos-r3-w7-094

No results collected for: tp5n_main_startup_fileio: 
		Error Thu, 18 Oct 2012 20:27:17

FAIL: Unable to proceed with missing counter 'tp5n_main_startup_fileio'
Traceback (most recent call last):
  File "run_tests.py", line 303, in ?
    main()
  File "run_tests.py", line 300, in main
    run_tests(parser)
  File "run_tests.py", line 276, in run_tests
    talos_results.output(results_urls, **results_options)
  File "C:\talos-slave\talos-data\talos\results.py", line 89, in output
    raise e
utils.talosError: "Unable to proceed with missing counter 'tp5n_main_startup_fileio'"

Let me know if you want pasted logs to track frequency, otherwise I'll just star these as "t" and move on.
taras:vdo we want to keep tracking this couunter?  I would think so, it is just odd we don't collect data all the time.
Summary: Intermittent utils.talosError: "Unable to proceed with missing counter 'tp5n_main_startup_fileio'" → Intermittent "talosError: Unable to proceed with missing counter 'tp5n_main_startup_fileio'"
https://tbpl.mozilla.org/php/getParsedLog.php?id=16458380&tree=Mozilla-Inbound

Is this affected by code, as in a push could cause it to break?
Taras: comment 10. These are a pain in the ass, they make desktop tests as annoying to star as Android, which is right on the edge of making all of us who star take up a more pleasant hobby like punching kittens.

Are we going to keep these things we don't always collect? Are we going to add debugging code to tell us why we don't always collect them? Are we going to TAKE OUT THE GODDAMN CODE THAT TURNS THE RUN RED AND THROWS AWAY ALL THE OTHER RESULTS when we don't collect something?
Flags: needinfo?(taras.mozilla)
(In reply to Phil Ringnalda (:philor) from comment #73)
> Taras: comment 10. These are a pain in the ass, they make desktop tests as
> annoying to star as Android, which is right on the edge of making all of us
> who star take up a more pleasant hobby like punching kittens.
> 
> Are we going to keep these things we don't always collect? Are we going to
> add debugging code to tell us why we don't always collect them? Are we going
> to TAKE OUT THE GODDAMN CODE THAT TURNS THE RUN RED AND THROWS AWAY ALL THE
> OTHER RESULTS when we don't collect something?

we need this test if we are to improve startup time. However this is clearly not working out as a rigid test so lets make the test not fail until we figure out what fails.
Joel, can you undo the hard failure in this test until we have confidence this works?
Flags: needinfo?(taras.mozilla) needinfo?(taras.mozilla) → needinfo+
(In reply to Taras Glek (:taras) from comment #78)
> (In reply to Phil Ringnalda (:philor) from comment #73)
> > Taras: comment 10. These are a pain in the ass, they make desktop tests as
> > annoying to star as Android, which is right on the edge of making all of us
> > who star take up a more pleasant hobby like punching kittens.
> > 
> > Are we going to keep these things we don't always collect? Are we going to
> > add debugging code to tell us why we don't always collect them? Are we going
> > to TAKE OUT THE GODDAMN CODE THAT TURNS THE RUN RED AND THROWS AWAY ALL THE
> > OTHER RESULTS when we don't collect something?
> 
> we need this test if we are to improve startup time. However this is clearly
> not working out as a rigid test so lets make the test not fail until we
> figure out what fails.
> Joel, can you undo the hard failure in this test until we have confidence
> this works?
Flags: needinfo?(jmaher)
there is no point in us collecting data if we are not collecting it all the time.  We need to make a decision to allocate resources and debug the problem or we need to stop collecting data we don't care about. 

Obviously we care about it, :taras, can you have somebody from your team look at this and figure out why we are not collecting this data all the time?

I am booked up for panda board work for the unforeseen future.
Flags: needinfo?(jmaher)
Duplicate of this bug: 807454
(In reply to Joel Maher (:jmaher) from comment #117)
> there is no point in us collecting data if we are not collecting it all the
> time.

Why not? We coalesce away talos jobs all the time, maybe we run xperf on this push, maybe we don't. Why is it pointless to run it at all if it sometimes misses a number, but it's not pointless to run it when it frequently misses a push?

https://tbpl.mozilla.org/php/getParsedLog.php?id=16660020&tree=Mozilla-Inbound
(In reply to Joel Maher (:jmaher) from comment #117)
> there is no point in us collecting data if we are not collecting it all the
> time.  We need to make a decision to allocate resources and debug the
> problem or we need to stop collecting data we don't care about. 
> 
> Obviously we care about it, :taras, can you have somebody from your team
> look at this and figure out why we are not collecting this data all the time?

I'm happy to allocate resources to this, but we wont be able to do this for another 2 months. So lets stop collecting this in the meantime.


(In reply to Phil Ringnalda (:philor) from comment #119)
> (In reply to Joel Maher (:jmaher) from comment #117)
> > there is no point in us collecting data if we are not collecting it all the
> > time.
> 
> Why not? We coalesce away talos jobs all the time, maybe we run xperf on
> this push, maybe we don't. Why is it pointless to run it at all if it
> sometimes misses a number, but it's not pointless to run it when it
> frequently misses a push?
> 

Coalescing of talos numbers is a problem, doesn't mean we should introduce new problematic measures.
Stop collecting per comment 122.
Assignee: nobody → bmo
Status: NEW → ASSIGNED
Attachment #678068 - Flags: review?(jmaher)
Comment on attachment 678068 [details] [diff] [review]
Stop collecting tp5n_main_startup_fileio

Review of attachment 678068 [details] [diff] [review]:
-----------------------------------------------------------------

thanks for doing this.  I need to work on talos for a day to get a whole queue of stuff landed.
Attachment #678068 - Flags: review?(jmaher) → review+
Np; thank you :-)

https://hg.mozilla.org/build/talos/rev/ec4d0dc9513e

Leaving open for talos deploy.
Whiteboard: [orange] → [orange][needs talos deploy]
Depends on: 811361
Sigh.

https://tbpl.mozilla.org/php/getParsedLog.php?id=17030615&tree=Mozilla-Inbound

{
FAIL: Unable to proceed with missing counter 'tp5n_main_shutdown_fileio'
Traceback (most recent call last):
  File "run_tests.py", line 311, in ?
    main()
  File "run_tests.py", line 308, in main
    run_tests(parser)
  File "run_tests.py", line 284, in run_tests
    talos_results.output(results_urls, **results_options)
  File "C:\talos-slave\talos-data\talos\results.py", line 89, in output
    raise e
utils.talosError: "Unable to proceed with missing counter 'tp5n_main_shutdown_fileio'"
}
Summary: Intermittent "talosError: Unable to proceed with missing counter 'tp5n_main_startup_fileio'" → Intermittent "talosError: Unable to proceed with missing counter 'tp5n_main_startup_fileio'" or "talosError: Unable to proceed with missing counter 'tp5n_main_shutdown_fileio'"
First patch in production, though we need to now deal with tp5n_main_shutdown_fileio too.
Whiteboard: [orange][needs talos deploy] → [orange]
Depends on: 812315
it looks like all our failures occur on:
talos-r3-w7-080 -> talos-r3-w7-099

My question to releng is what is the range of w7 machines?  do we really have: 
talos-r2-w7-{00|79} ?
Could this bug be related to missing mozprofilerprobe.mof?
I would not be surprised if the ref image had it missing.

The range of win7 slaves is talos-r3-w7-[001-104] (slaves 1, 2, 3 & 10 are for staging).
Depends on: 813239
Backing out the fix here (talos ec4d0dc9513e), given that:
a) It means we just hit bug  instead
b) Bug should fix the problem for us

Backout:
https://hg.mozilla.org/build/talos/rev/751345a46752
Fail.

> a) It means we just hit bug  instead
Bug 812729

> b) Bug should fix the problem for us
Bug 813239
Unassigning since nothing to do here, since real fix is bug 813239.
Assignee: bmo → nobody
Status: ASSIGNED → NEW
Whiteboard: [orange]
when can we close this bug?  Want to wait 2 weeks to make sure it doesn't happen anymore?
(In reply to Joel Maher (:jmaher) from comment #278)
> when can we close this bug?  Want to wait 2 weeks to make sure it doesn't
> happen anymore?

Sounds good.
Resolving WFM keyword:intermittent-failure bugs last modified >3 months ago, whose whiteboard contains none of:
{random,disabled,marked,fuzzy,todo,fails,failing,annotated,time-bomb,leave open}

There will inevitably be some false positives; for that (and the bugspam) I apologise. Filter on orangewfm.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.