Closed Bug 1024934 Opened 10 years ago Closed 10 years ago

Ouija slave_failures.html is missing many jobs for each device

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: dminor)

References

Details

Go to:
http://54.215.155.53/slave_failures.html

The date range defaulted to:
2014-06-06 -> 2014-06-13

Pick pretty much any machine in the list - and look at the total job count.

eg:
talos-mtnlion-r5-073 - total: 26

Follow the link to that machines health page:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-mtnlion-r5-073

And note that in the "last 100 jobs" view, the oldest job there is ~11th June, ie: between the 11th and now, at least 100 jobs were completed - so more than 25 (and that's not counting the ones we can't see there from 2014-06-06 -> 2014-06-11).

Another example:
Sort by the "retries" column (descending).

In inevitably a panda will surface to the top:
eg panda-0283 - retries: 5

On the health page:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0283
...I can see at least 15 retries, and that's only in the last 100 jobs (which is a couple of days)

Also note that chronic cases of bad-tegras (where almost all of the last 100 jobs were retries) - eg:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tegra-038

...don't even show in the list, even though they have had more than 5 jobs in the timeframe:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tegra-038

My use case here is trying to find bad tegras/pandas to stop them chewing through jobs (current workflow is just clicking through the blues on TBPL) - but at the moment data is missing, so this isn't possible.

Thanks :-)
Blocks: Ouija
At the moment the only branch we're pulling from is mozilla-central and we're only updating the database once per day. To make this useful to you we'd have to start pulling from more branches and doing that more often, which is easy enough, but will increase load.

Before I do that, jmaher, is there enough overlap in what we collect that we can share data between ouija and your talos sheriffing tool?

We could also maybe use a different source for the slave_failures (discover slave names from the usual sources, and then use the slave_health reports directly.)
Flags: needinfo?(jmaher)
the Talos alert manager doesn't do much with tinderbox logs, we source from graph server and hg for revisions.

This machine doesn't support a lot of load, we could increase the branches we pull from and keep it once a day for a starting position.
Flags: needinfo?(jmaher)
Ah thank makes sense. The main problem I was having was that mozilla-central is a low-activity repo, and as such, doesn't given a sample size large enough (particularly given that I'm most interested in catching machines that have gone bad only within the last day or two).

If we could also monitor {mozilla-inbound,b2g-inbound,fx-team} that would be great - or if that's problematic, perhaps switching from mozilla-central to mozilla-inbound would be an improvement for the short term at least?

Thanks! :-)
lets keep in mind the reason we use m-c for android status- it is the most stable branch, so we wouldn't have all the noise of backouts, etc.

This is a great idea to get a larger picture view, but for the weekly android reports, we should keep it pinned at m-c.
(In reply to Joel Maher (:jmaher) from comment #4)
> lets keep in mind the reason we use m-c for android status- it is the most
> stable branch, so we wouldn't have all the noise of backouts, etc.
> 
> This is a great idea to get a larger picture view, but for the weekly
> android reports, we should keep it pinned at m-c.

It looks like the failures query specifies branch = 'mozilla-central' but this is something we'll have to be careful of in the future.
Assignee: nobody → dminor
Status: NEW → ASSIGNED
Pull request here: https://github.com/dminor/ouija/pull/10
Ed, if all went according to plan we should now have a week's worth of data across the four branches. Do you mind taking a quick look and seeing if this gives enough coverage for you to be able to identify bad devices?

Sorry, I realize it's very slow at the moment now that the extra data has been added. If this looks like it could be useful to you, I'll try a small or medium AWS instance and see if that fixes the performance.

Thanks!
Flags: needinfo?(emorley)
The performance was ok TBH, and using it this morning I found 3 machines that were bad and needed disabling, so I think this is good for now. Thank you! :-)
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(emorley)
Resolution: --- → FIXED
(In reply to Ed Morley [:edmorley UTC+0] from comment #8)
> The performance was ok TBH, and using it this morning I found 3 machines
> that were bad and needed disabling, so I think this is good for now. Thank
> you! :-)

Cool, glad to hear it!
You need to log in before you can comment on or make changes to this bug.