Add ability to Ouija slaves graph to display acceptable failure rate for a given slave type

RESOLVED FIXED

Status

Testing
General
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: jmaher, Unassigned)

Tracking

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [good first bug][mentor=dminor][lang=python])

Attachments

(1 attachment, 1 obsolete attachment)

There are many instances where a specific slave will run into hardware or OS configuration issues and fail more frequently than its peers.  This is easy to detect when looking back at history because it will show an abnormal amount of failures.

It would be nice to have another column in the slaves table that would show the max % of acceptable failures.

This should be done by looking at the machine name 't-w864-ix-093', and you can determine the type by stripping off the -[0-9]+ at the end so it would be 't-w864-ix'.  There are many families of machines and by looking at the total number of jobs and the total number of failures (excluding retries), we could determine a % failure for that given platform.

Comment 1

5 years ago
But slaves table contains data related to slaves, not platforms. How acceptable failure rate should be displayed in this case? I see only one solution - calculate it for given platform and then show it for every machine of this type. Is that correct? What is the expected order for this column? After "Passes" and before "Total"?
Pretty much right on.  If you have all the data from your query for the slaves, you can parse the slave name into platform.  Given that platform, you can calculate the platform failure rate (PFR).  Now when displaying the specific slave information, it would be nice to know the total failure rate of that specific slave (numfailures/numjobs).  As for comparing this to the platform in general, there are a couple thoughts I have:
1) it would be redundant to have a column with the PFR, but it would be nice to have
2) calculating the PFR will be flawed if we have a few slaves with abnormally high failure rates. 

For 1, I think we can live with it, it might look like:
name,      failures, retries, infra, total, % failure, expected % failure
tegra-314, 21,       20,      9,    500,    10.0%,       3.14%
tegra-157, 3,        6,       1,    500,    2.0%,        3.14%


In this case, we can quickly determine that tegra-314 should be looked into and tegra-157 is operating normally.


Bonus points for making this a sortable table (I recall a javascript library to do this easily) on any given column :)

Comment 3

5 years ago
Few questions:
1. How do I calculate failure rate for platform? PFR = all failures (on all slaves of given platform) / all runs (failures + retries + passes on all slaves for given platform)? or all failures / (all runs - retries)?
2. What is infra? This column is not present at this moment, should I add it too?
3. You didn't mention how to mitigate point 2 in your reply (abnormal high failure rates for several slaves that spoil overall statistics for PFR calculation).
1) PFR = (sum of all failures on all slaves for a given platform) / (all runs on all slaves for the given platform).
  I am not sure if we should exclude retries or not.  Retries usually indicate a failure and it helps point out problems, but if it is a problem with a build/test/harness, then we will retry a lot and rack up failures on a lot of machines.  For now lets exclude them, bonus points to toggle that ;)

2) Infra is infrastructure related failures.  Specifically things like DNS failures, power outages, etc.  These are rare enough, but sometimes include hardware failures on the specific machine in test.  these should be denoted as a different failure type.

3) I don't have a solution to mitigate high failure rates propping up the overall statistics.  For now we can live with it, although I am open to more suggestions.

Thanks for making sure you understand this bug and do the right thing.  Looking forward to your patch.

Comment 5

5 years ago
1) That should be another checkbox to toggle that?
2) How can I recognize such failures? I looked into values stored in testtype, result, buildtype in database and found nothing similar to infra failures.
3) Perhaps, I can dig into statistics, but that was long time ago since I studied it in university :)
ok, I can find all the colors here: http://54.215.155.53/data/results?platform=android4.0

test failure: orange
infra: red
retry: blue
passing: green


If making a checkbox to include/exclude retries is doable, I vote for that.

Let me know if that helps at all.

Comment 7

5 years ago
OK, I added infra results.
Now failure rate is calculated as (num of fails * 100) / (num of fails + num of infra + num of passes).
Server side is ready for calculating failure rate including retries as (num of fails * 100) / (num of fails + num of retries + num of infra + num of passes).

I can submit patch for that right now.
I need a bit more time to add checkbox for 'including retries' in failure rate calculations.

Could we move sorting into separate issue?
Lets do the sortable tables in a different bug.  I have filed bug 919960 to track that work.

Comment 9

5 years ago
Created attachment 811673 [details] [diff] [review]
0001-show-failure-rates-switch-between-failure-rates.patch
Attachment #811673 - Flags: review?(dminor)

Comment 10

5 years ago
Comment on attachment 811673 [details] [diff] [review]
0001-show-failure-rates-switch-between-failure-rates.patch

Review of attachment 811673 [details] [diff] [review]:
-----------------------------------------------------------------

The changes look good, but unfortunately the patch you attached does not apply cleanly to the latest ouija changes from github and needs to be rebased.

In case you haven't done this before, merging from the github ouija master to your local master and then running 'git rebase master' from your local branch is probably the easiest way to do this.

Once the patch is updated, I'll be happy to take another look at it. Thanks!
Attachment #811673 - Flags: review?(dminor) → review-

Comment 11

5 years ago
Created attachment 812245 [details] [diff] [review]
resolved merge conflict

Thanks, Dan!
I resolved merge conflict.
Attachment #811673 - Attachment is obsolete: true
Attachment #812245 - Flags: review?(dminor)

Comment 12

5 years ago
Comment on attachment 812245 [details] [diff] [review]
resolved merge conflict

Review of attachment 812245 [details] [diff] [review]:
-----------------------------------------------------------------

Great work, thanks!
Attachment #812245 - Flags: review?(dminor) → review+

Comment 13

5 years ago
Committed here: https://github.com/dminor/ouija/commit/b0889c6390f92eb53f8b2b8aeb1f175e54885be7 and in production.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.