remove slaves with hardware problems from the pool

RESOLVED FIXED

Status

task
RESOLVED FIXED
7 years ago
a year ago

People

(Reporter: dbaron, Unassigned)

Tracking

Details

Attachments

(9 attachments, 1 obsolete attachment)

After bug 785724 I started to be suspicious of a few other Snow Leopard slaves for causing memory corruption bugs -- in particular, talos-r4-snow-013 and -041.  (See, e.g., bug 785724 comment 2.)

I decided, however, that finding slaves that have bad memory or disk ought to be something we can automate rather than do manually.

So I wrote a python script to look at the distribution of tinderbox failures across slaves.  In particular, for a set of slaves with a particular naming scheme, it will look for all of the orange bugs with that naming scheme occurring in comments, and report:
 * the number of *open* orange bugs that occurred on that slave
 * the number of *fixed* orange bugs that occurred on that slave (a proxy for how long the slave has been in use and how many failures there are on its platform -- i.e., how many real bugs it found)
 * the list of open orange bugs that occurred *only* on that slave

The idea is that slaves with hardware problems are likely to have a higher open/fixed ratio, and are likely to have a larger set of unique open bugs.


I think a few problematic test slaves are probably the source of a bunch (though not a huge portion) of our intermittent orange, and we should eliminate those machines from the pool -- particularly if the unique reports that they're giving seem related to memory or disk corruption.

(I'd note that the list of unique bugs isn't necessarily the list of everything that's problematic -- I saw a bug that has been seen on only -013 and -041, but nowhere else.  So it won't show up as unique to any slave with this script.)


I don't have time to take this research to completion right now, but I've got some useful data already.  I'm curious what philor and edmorley think of the data here (which I'll attach shortly).
Posted file look-for-bad-slaves.py (obsolete) —
To use:
 * uncomment one of the lines near the bottom
 * run it
I think this confirms my suspicions about 013 and 041.  Some other machines might also be cast under slight doubt, but not nearly as much.

This is before I added the sort to the list of unique bugs, but I didn't want to spend the bugzilla query time to rerun it.
I think all of these machines are healthy.
Comment on attachment 657100 [details]
look-for-bad-slaves.py

Rather than constantly update this attachment on a bug, I stuck this in a user repository.  (It's already improved over this attachment.)

https://hg.mozilla.org/users/dbaron_mozilla.com/bad-slaves/
Attachment #657100 - Attachment is obsolete: true
I think all of these look healthy as well.
I think there's something with 069, and maybe also 031.
038 seems awfully suspicious
Recognizing the bug 743465 bustedness of talos-r4-lion-069 is a pretty fair test of a badslave finder, you can see from the bug how many times I've struggled to identify it as broken.

This feels like something http://brasstacks.mozilla.com/orangefactor/ ought to be doing, if not for the fact that it refuses to believe in things that were typed into tbpl's comment box, so it knows a great deal about which slaves have hit which bugs except for its huge blind spots.
A few that are bit suspicious, but nothing clearly bad, I think.
Posted file output for tegra-
tegra-159 looks pretty suspicious
I think the summary list of machines that ought to be removed from the pool promptly if they're not already is (along with existing bugs mentioning those machines in the summary):

talos-r4-snow-013
talos-r4-snow-014 (bug 779332) (ALREADY DISABLED)
talos-r4-snow-041 (bug 752185, bug 786006) (ALREADY DISABLED)
talos-r4-lion-069 (bug 706492, bug 713883, bug 723282, bug 743465, bug 770525, bug 786005)
talos-r3-w7-038 (bug 768512, bug 776924)
tegra-159 (bug 747694)

unless, that is, there are problems known to have been fixed since the bugs shown in the data were observed.
And I think we ought to go through and mark the [orange] bugs caused by bad hardware as invalid to avoid having people spending time on them.
(In reply to David Baron [:dbaron] from comment #11)
> I think the summary list of machines that ought to be removed from the pool
> promptly if they're not already is (along with existing bugs mentioning
> those machines in the summary):
> 
> talos-r4-snow-013
> talos-r4-snow-014 (bug 779332) (ALREADY DISABLED)
> talos-r4-snow-041 (bug 752185, bug 786006) (ALREADY DISABLED)
> talos-r4-lion-069 (bug 706492, bug 713883, bug 723282, bug 743465, bug
> 770525, bug 786005)
> talos-r3-w7-038 (bug 768512, bug 776924)
> tegra-159 (bug 747694)

Oranges that have occurred only on these six machines account for about 6% of our total open orange bugs.
(In reply to David Baron [:dbaron] from comment #13)
> Oranges that have occurred only on these six machines account for about 6%
> of our total open orange bugs.

Wow, thank you for looking at this!

OrangeFactor does have a 'by machine' view (which I've tried to use occasionally), but it's slightly lacking in its current form. Given this bug, it seems like improving it would be a worthwhile investment of time.
I should have run this earlier; talos-r3-leopard-048 also looks pretty bad.
I was thinking about this and I realized that it's a good opportunity to try to figure out how many of our crashes are related to bad hardware. Let's say we have the following numbers:

C = # crashes that we see on crashstats
C_g = # crashes that appear GC-related based on signature
C_h = # crashes that are caused by hardware errors
C_gh = # crashes that appear GC-related and are caused by hardware

Then we can make some rough estimates, guessing that crashes on tinderbox occur with similar frequency to crashes in the wild:

C_gh / C_h:
I looked at the bugs that dbaron closed and I filtered out the non-crashers. There were 80 crashers. Of those, 41 had signatures that would naturally cause us to blame the GC. Extrapolating to crashes in the wild, we can say C_gh/C_h ~= 41/80 = 51%.

C_g / C:
In bug 719114 comment 23, Scoobidiver came up with an estimate for how many of our crashes are caused by GC. It varied from 3% to 5%, so I estimated that C_g/C ~= 4%.

C_h / C:
Estimating how many crashes are caused by hardware failures is pretty hard. I guessed 6%, because that's the percentage of open oranges accounted for by the broken slaves. It would be interesting to see if there's data about this from other companies.

Anyway, now we can do the following:

C_gh / C_g
  = <probability that a crash with a GC-related crash signature is caused by hardware>
  = (C_gh / C_h) * (C_h / C) * (C / C_g)
  = (C_gh / C_h) * (C_h / C) / (C_g / C)
  = 0.51 * 0.06 / 0.04
  = 0.75

This kinda suggests that 75% of the GC crashes in crashstats are caused by bad memory. I think the main source of error here is the idea that tinderbox crashes are statistically related to crashstats crashes.

I wonder if there's any way that we could run some hardware checks on a user's computer while Firefox is running. Then we could tell them that there's a problem. Probably the first step would be to figure out what's wrong with these slaves and come up with a way to detect it in software (with low overhead). It isn't totally obvious what's even going on here. I ran a memory tester on talos-r3-w7-038 a while back and found no problems. Since then it was put back in the pool and two more oranges were filed on it exclusively.
I'm pretty skeptical of comment 19, because the crashes in our test farm are in a very controlled environment, with very similar machines and no varying external software -- the exact environment in which we're going to crash least because we test it so well.  In other words, I'd expect C_h / C to be much higher in our very controlled test environment (where you estimate it as 6%) than it is in the real world (which has a lot more crashes that we don't see in our very controlled environment).
(In reply to David Baron [:dbaron] from comment #11)
> I think the summary list of machines that ought to be removed from the pool
> promptly if they're not already is (along with existing bugs mentioning
> those machines in the summary):
> 
> talos-r4-snow-013

Done, bug 787844.

> talos-r4-snow-014 (bug 779332) (ALREADY DISABLED)

Re-disabled, bug 779332.

> talos-r4-snow-041 (bug 752185, bug 786006) (ALREADY DISABLED)

Still disabled.

> talos-r4-lion-069 (bug 706492, bug 713883, bug 723282, bug 743465, bug
> 770525, bug 786005)

Disabled now, bug 743465.

> talos-r3-w7-038 (bug 768512, bug 776924)

Disabled now, bug 768512

> tegra-159 (bug 747694)

I'm not sure about this one, most of the unique bugs it's in are from 2012-08-06 with no repeats, so I've left it enabled.
(In reply to David Baron [:dbaron] from comment #18)
> Created attachment 657486 [details]
> output for talos-r3-leopard
> 
> I should have run this earlier; talos-r3-leopard-048 also looks pretty bad.

This one recently became talos-r3-w7-081 in bug 786037. Are you OK with giving it a go there or do you prefer to pre-emptively disable ?
> I wonder if there's any way that we could run some hardware checks on a
> user's computer while Firefox is running. Then we could tell them that
> there's a problem.

I would suggest running a full memory correctness scan on start-up, but I fear Taras would have to take sick leave to recover from the cuts on his hands after he punches in his monitor.
(In reply to Nicholas Nethercote [:njn] from comment #23)
> > I wonder if there's any way that we could run some hardware checks on a
> > user's computer while Firefox is running. Then we could tell them that
> > there's a problem.
> 
> I would suggest running a full memory correctness scan on start-up, but I
> fear Taras would have to take sick leave to recover from the cuts on his
> hands after he punches in his monitor.

I don't think a low-overhead memory-robustness check exists. However we can do proxies such as checking system temperature, attempting to detect how often software is crashing on the machine, etc. Reporting temperatures in breakpad might be fairly straight-forward. Anecdotal evidence: crappy consumer laptops tend to overheat a lot after a few years which leads to a lot of crashes.

Nick, rather than creating a [Snappy] painpoint, I propose we take the hit on [MemShrink] side ala https://www.cs.princeton.edu/~appel/papers/memerr.pdf where fill memory with garbage objects and iterate through them checking if they changed :)
You'd need temperature *and* tripping points, provided they're correct.
Back in 1999 we had a form reply to people who reported that GCC crashed with a segmentation fault while compiling the Linux kernel: "Type 'make' again.  Does it crash in the _same place_?  If not, you have bad RAM."
The problem is that we have no data about the crashiness of individual machines over time, as far as I'm aware, and we can't really gather that server side due to privacy concerns. I wonder if you could have a computer track how many GC-related crashes it had, and report that somehow.
Can we move the discussion of detecting users with bad RAM to a different bug?
(In reply to David Baron [:dbaron] from comment #28)
> Can we move the discussion of detecting users with bad RAM to a different
> bug?

I filed bug 788686 for this.
What's the end state of this particular bug? Should I be filing decommission bugs for these slaves? I think not even the r4 slaves are covered by warranty/AppleCare any longer.
I just want to point out that jobs are *not* distributed evenly between test slaves. Jobs are run on the most recent slave available to have run that same job type. See bug 714313.
It's a little hard to figure out what is left to be done on this bug.

I see the following slave taking jobs:
* talos-r4-snow-013

I see the following slaves *not* taking jobs but have been diagnosed clean
* talos-r4-snow-014
* talos-r4-snow-041
* talos-r4-lion-069
* talos-r3-w7-038

dbaron, would you mind running your script again? Or showing us how to run it on our side?

What do we do about the slaves that were not diagnosed any issues?
Can we put them back in the pool and run dbaron's tool after a week and see if they are repeating offenders?
Can we ship one of those machines to someone to figure out a different hardware diagnostic tools? or how to reproduce the issue?

I would like to drive this to a resolution.

#########################################
[1]
Diagnostics were run for these slaves:
https://bugzilla.mozilla.org/show_bug.cgi?id=794926#c1
(In reply to Vinh Hua [:vinh] from comment #1)
> * talos-r4-snow-013 - Memory replaced and no errors were found afterwards.
> * talos-r4-snow-014 - No issues found after diagnostics test.
> * talos-r4-snow-041 - No issues found after diagnostics test.
> * talos-r4-lion-069 - No issues found after diagnostics test.
> 
> * talos-r3-w7-038   - Will need to find appropriate software to run
> diagnostics on this one since it's running Windows OS. (opening a different
> ticket = Bug 800051)
> 
> All working slaves have been placed back into Sonnet chassis and powered on.
A new round of output, which I haven't looked into yet.
Not quite sure what we can do to remove the noise from that report - being the only slave of a particular platform to hit an orange can be a sign of brokenness (generally by meaning that someone misstarred something), but it's horribly noisy (quite often also by meaning that someone misstarred something). 

This also depends on us always filing every single one-off failure, which we certainly don't, and on us not having things like "every single b2g marionette-webapi test can time out, so we'll eventually have a separate bug for each one with only one or two instances in it" and "every single b2g reftest can time out, and some people don't know there's a single bug for them all" and "every single b2g crashtest can..." and "every single b2g mochitest can...".

Despite the 8 unique, I think talos-r3-fed-020 is fine; despite only having 3 unique, I think talos-r4-lion-041 may well prove to be busted.
> Despite the 8 unique, I think talos-r3-fed-020 is fine; despite only having
> 3 unique, I think talos-r4-lion-041 may well prove to be busted.

Sounds like automated collation and (expert) human filtering is the required combination for this to be useful.  That's not as nice as a purely automated system, but it's better than nothing.
Yeah... aan the automated collation be done on regular intervals (I don't know what to suggest for interval) and dashboarded or something?
Product: mozilla.org → Release Engineering
The original intent of this bug was accomplished by nthomas.
Please file adequate files either for Releng or other teams on those suggested ideas.
There's nothing left on this specific bug to get done.

####################

It seems like the discussion about detecting users with bad RAM was moved to bug 788686

dbaron, can you please show us how to run your script? If you could attach it to the bug perhaps someone could grab it and work on it further.

> Sounds like automated collation and (expert) human filtering is the required
> combination for this to be useful.  That's not as nice as a purely automated
> system, but it's better than nothing.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
The script is kept in the repository at https://hg.mozilla.org/users/dbaron_mozilla.com/bad-slaves/ .  It needs manual updates (at the bottom of the script) to reflect new sets of slaves added to the pool.

To use it, run it (and redirect the standard output to a file) and then look for anomalies in the output.

Updated

a year ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.