1024934 - Ouija slave_failures.html is missing many jobs for each device

Reporter

Description

•

10 years ago

Go to:
http://54.215.155.53/slave_failures.html

The date range defaulted to:
2014-06-06 -> 2014-06-13

Pick pretty much any machine in the list - and look at the total job count.

eg:
talos-mtnlion-r5-073 - total: 26

Follow the link to that machines health page:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-mtnlion-r5-073

And note that in the "last 100 jobs" view, the oldest job there is ~11th June, ie: between the 11th and now, at least 100 jobs were completed - so more than 25 (and that's not counting the ones we can't see there from 2014-06-06 -> 2014-06-11).

Another example:
Sort by the "retries" column (descending).

In inevitably a panda will surface to the top:
eg panda-0283 - retries: 5

On the health page:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=panda-0283
...I can see at least 15 retries, and that's only in the last 100 jobs (which is a couple of days)

Also note that chronic cases of bad-tegras (where almost all of the last 100 jobs were retries) - eg:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tegra-038

...don't even show in the list, even though they have had more than 5 jobs in the timeframe:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=tegra-038

My use case here is trying to find bad tegras/pandas to stop them chewing through jobs (current workflow is just clicking through the blues on TBPL) - but at the moment data is missing, so this isn't possible.

Thanks :-)

Ed Morley [:emorley]

Reporter

Updated

•

10 years ago

Blocks: Ouija

Dan Minor [:dminor]

Assignee

Comment 1

•

10 years ago

At the moment the only branch we're pulling from is mozilla-central and we're only updating the database once per day. To make this useful to you we'd have to start pulling from more branches and doing that more often, which is easy enough, but will increase load.

Before I do that, jmaher, is there enough overlap in what we collect that we can share data between ouija and your talos sheriffing tool?

We could also maybe use a different source for the slave_failures (discover slave names from the usual sources, and then use the slave_health reports directly.)

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 2

•

10 years ago

the Talos alert manager doesn't do much with tinderbox logs, we source from graph server and hg for revisions.

This machine doesn't support a lot of load, we could increase the branches we pull from and keep it once a day for a starting position.

Flags: needinfo?(jmaher)

Ed Morley [:emorley]

Reporter

Comment 3

•

10 years ago

Ah thank makes sense. The main problem I was having was that mozilla-central is a low-activity repo, and as such, doesn't given a sample size large enough (particularly given that I'm most interested in catching machines that have gone bad only within the last day or two).

If we could also monitor {mozilla-inbound,b2g-inbound,fx-team} that would be great - or if that's problematic, perhaps switching from mozilla-central to mozilla-inbound would be an improvement for the short term at least?

Thanks! :-)

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

10 years ago

lets keep in mind the reason we use m-c for android status- it is the most stable branch, so we wouldn't have all the noise of backouts, etc.

This is a great idea to get a larger picture view, but for the weekly android reports, we should keep it pinned at m-c.

Dan Minor [:dminor]

Assignee

Comment 5

•

10 years ago

(In reply to Joel Maher (:jmaher) from comment #4)
> lets keep in mind the reason we use m-c for android status- it is the most
> stable branch, so we wouldn't have all the noise of backouts, etc.
> 
> This is a great idea to get a larger picture view, but for the weekly
> android reports, we should keep it pinned at m-c.

It looks like the failures query specifies branch = 'mozilla-central' but this is something we'll have to be careful of in the future.

Dan Minor [:dminor]

Assignee

Updated

•

10 years ago

Assignee: nobody → dminor

Status: NEW → ASSIGNED

Dan Minor [:dminor]

Assignee

Comment 6

•

10 years ago

Pull request here: https://github.com/dminor/ouija/pull/10

Dan Minor [:dminor]

Assignee

Comment 7

•

10 years ago

Ed, if all went according to plan we should now have a week's worth of data across the four branches. Do you mind taking a quick look and seeing if this gives enough coverage for you to be able to identify bad devices?

Sorry, I realize it's very slow at the moment now that the extra data has been added. If this looks like it could be useful to you, I'll try a small or medium AWS instance and see if that fixes the performance.

Thanks!

Flags: needinfo?(emorley)

Ed Morley [:emorley]

Reporter

Comment 8

•

10 years ago

The performance was ok TBH, and using it this morning I found 3 machines that were bad and needed disabling, so I think this is good for now. Thank you! :-)

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Flags: needinfo?(emorley)

Resolution: --- → FIXED

Dan Minor [:dminor]

Assignee

Comment 9

•

10 years ago

(In reply to Ed Morley [:edmorley UTC+0] from comment #8)
> The performance was ok TBH, and using it this morning I found 3 machines
> that were bad and needed disabling, so I think this is good for now. Thank
> you! :-)

Cool, glad to hear it!

Bugzilla

Quick Search

Ouija slave_failures.html is missing many jobs for each device

Categories

(Testing :: General, defect)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: dminor)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9