Closed
Bug 824754
(t-snow-r4-0044)
Opened 12 years ago
Closed 11 years ago
t-snow-r4-0044 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P3)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: u429623, Unassigned)
References
Details
(Whiteboard: [buildduty][buildslaves][capacity][badslave])
Attachments
(2 files, 1 obsolete file)
2.66 KB,
patch
|
armenzg
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
2.09 KB,
patch
|
bhearsum
:
review+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
talos-r4-snow-046 is reported as "seeing a rash of mysterious crashes in debug tests" - see bug 824498 for details
Please run diagnostics to see if there are any hardware issues, and resolve as needed.
Regardless of hardware issues, please reimage host before returning to releng.
Summary: talos-r4-snow-046 → talos-r4-snow-046 showing inexplicable crashes, hardware suspected
Updated•12 years ago
|
colo-trip: --- → scl1
Comment 1•12 years ago
|
||
I did a regular reboot on this host and it came up with no problems, I didn't read the diags part. Will get to this Monday
Updated•12 years ago
|
Assignee: server-ops-dcops → nobody
Component: Server Operations: DCOps → Release Engineering: Machine Management
QA Contact: dmoore → armenzg
Updated•12 years ago
|
Summary: talos-r4-snow-046 showing inexplicable crashes, hardware suspected → talos-r4-snow-046 problem tracking
Comment 2•12 years ago
|
||
This is now re-enabled in prod
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 3•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=19457159&tree=Mozilla-Inbound is exactly the same sort of mysterious crash as bug 824498
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•12 years ago
|
Whiteboard: [buildduty][buildslaves][capacity] → [buildduty][buildslaves][capacity][badslave]
Comment 4•12 years ago
|
||
disabled in slavealloc again.
coop ideas on what our next step is?
Flags: needinfo?(coop)
Comment 5•12 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #4)
> disabled in slavealloc again.
>
> coop ideas on what our next step is?
This is the point where we usually need to replace the logic board. Please open an IT bug with dcops to start that process. Bonus points if you can batch it with other slaves that need the same attention.
Flags: needinfo?(coop)
Comment 6•12 years ago
|
||
Slave has been repaired, reimaged, and is back in service.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 7•12 years ago
|
||
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 8•12 years ago
|
||
Comment 9•12 years ago
|
||
Comment 10•12 years ago
|
||
See also bug 824498 - it looks like we're still getting intermittent crashes on this machine.
Comment 11•12 years ago
|
||
Comment 12•12 years ago
|
||
Comment 13•12 years ago
|
||
Comment 14•12 years ago
|
||
I'm moving this slave to staging permanently and will mark it as such in slavealloc.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 15•12 years ago
|
||
Swapping 1-for-1 slaves between staging and production. See previous comments in this bug for reasons why snow-046 is unsuitable for production.
Assignee: nobody → coop
Status: RESOLVED → REOPENED
Attachment #720729 -
Flags: review?(armenzg)
Resolution: FIXED → ---
Updated•12 years ago
|
Attachment #720729 -
Flags: review?(armenzg) → review+
Comment 16•12 years ago
|
||
Can you please check if it exists on the graphs production DB?
Comment 17•12 years ago
|
||
Comment on attachment 720729 [details] [diff] [review]
Move snow-010 to production and snow-046 to staging
Review of attachment 720729 [details] [diff] [review]:
-----------------------------------------------------------------
https://hg.mozilla.org/build/buildbot-configs/rev/8f4dfb408cd0
Attachment #720729 -
Flags: checked-in+
Comment 18•12 years ago
|
||
I've rebooted both slaves into their own respective pools.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 19•12 years ago
|
||
Merged and reconfiguration completed.
Assignee | ||
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Comment 20•11 years ago
|
||
not responding to pdu reboots
Updated•11 years ago
|
Assignee: coop → nobody
Comment 21•11 years ago
|
||
Back in production after an HD replacement.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 11 years ago
Resolution: --- → FIXED
Comment 22•11 years ago
|
||
Disabled in slavealloc meanwhile.
Attachment #814148 -
Flags: review?(bhearsum)
Updated•11 years ago
|
Attachment #814148 -
Flags: review?(bhearsum) → review+
Comment 23•11 years ago
|
||
Comment on attachment 814148 [details] [diff] [review]
add snow-046 back to the production pool
https://hg.mozilla.org/build/buildbot-configs/rev/1952f6d18716
Attachment #814148 -
Flags: checked-in+
Comment 24•11 years ago
|
||
test_stag_not_in_prod ... [FAIL]
===============================================================================
[FAIL]: test_slave_allocation.SlaveCheck.test_stag_not_in_prod
Traceback (most recent call last):
File "test/test_slave_allocation.py", line 33, in test_stag_not_in_prod
'declared as staging-only:\n%s' % '\n'.join(sorted(common_slaves))
twisted.trial.unittest.FailTest: Staging-only slaves should not be declared as production and vice versa. However, the following production slaves declared as staging-only:
talos-r4-snow-046
not equal:
a = set()
b = set(['talos-r4-snow-046'])
Comment 25•11 years ago
|
||
Backed it out and here's the good patch after testing it with test-masters.sh
Attachment #814148 -
Attachment is obsolete: true
Attachment #814166 -
Flags: review?(bhearsum)
Updated•11 years ago
|
Attachment #814166 -
Flags: review?(bhearsum) → review+
Comment 26•11 years ago
|
||
Comment on attachment 814166 [details] [diff] [review]
add snow-046 back to the production pool and remove it from staging
https://hg.mozilla.org/build/buildbot-configs/rev/1ec4a515adcf
Attachment #814166 -
Flags: checked-in+
Updated•11 years ago
|
Assignee: nobody → armenzg
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 27•11 years ago
|
||
in production
Comment 28•11 years ago
|
||
Rebooted the machine into production and enabled it on slavealloc.
Assignee: armenzg → nobody
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Comment 29•11 years ago
|
||
I'll want to be able to find the GC crash in https://tbpl.mozilla.org/php/getParsedLog.php?id=28895921&tree=Try again.
Comment 30•11 years ago
|
||
And the GC crash in https://tbpl.mozilla.org/php/getParsedLog.php?id=28906391&tree=Try
Comment 31•11 years ago
|
||
What do you mean? Is the slave doing something unexpected?
I don't know much about "finding the GC crash"
Comment 32•11 years ago
|
||
Garbage collection and cycle collection apparently do a good job of exercising RAM, and so typically a machine with bad RAM (or a CPU that's bad about talking to its RAM, or a single trace with a hairline crack in it between the CPU and the RAM, or whatever it may really be) will, along with hitting PPoD failures in reftests, hit a lot of GC crashes.
That's what this slave did, and the reason we had multiple bugs filed about crashes in tests that only happened on this slave, and that's the reason it was in staging rather than production with a slavealloc note saying not to put it in production.
Disabled in slavealloc, please do not bring it back to production without diagnosing what's actually wrong with the memory, fixing it, and running at least a hundred test runs in staging without a single unexplained unexpected not-seen-on-other-slaves crash.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 34•11 years ago
|
||
philor: any ideas if we could word this for developers to try to find an easy test case?
(In reply to Phil Ringnalda (:philor) from comment #32)
> Garbage collection and cycle collection apparently do a good job of
> exercising RAM, and so typically a machine with bad RAM (or a CPU that's bad
> about talking to its RAM, or a single trace with a hairline crack in it
> between the CPU and the RAM, or whatever it may really be) will, along with
> hitting PPoD failures in reftests, hit a lot of GC crashes.
>
> That's what this slave did, and the reason we had multiple bugs filed about
> crashes in tests that only happened on this slave, and that's the reason it
> was in staging rather than production with a slavealloc note saying not to
> put it in production.
>
> Disabled in slavealloc, please do not bring it back to production without
> diagnosing what's actually wrong with the memory, fixing it, and running at
> least a hundred test runs in staging without a single unexplained unexpected
> not-seen-on-other-slaves crash.
Comment 35•11 years ago
|
||
Sorry, that was casual phrasing on my part. I have no reason to believe that GC/CC are *better* at detecting bad RAM than memtest86 (which has been under development for almost 20 years, focusing on just that one task), they are simply the most likely part of our tests to wind up crashing. Running tests for 24 hours with me looking at the results will indeed show intermittent memory failures, along with lots of noise which is not from intermittent memory failures, but it's not a better way.
I think it would be a far better thing to focus on first running memtest86 on this slave once, since I think any diagnostics on it would have been before we started using that rather than Apple's memory diagnostics, and second on having a plan to run it long enough to detect intermittent failures that don't happen on just one quick run.
Comment 36•11 years ago
|
||
Thanks philor :)
Comment 37•11 years ago
|
||
Memory analysis requested in bug 933886.
Comment 38•11 years ago
|
||
2013-01-03 - "Ran hardware diagnostic three times but did not find any issues. All hardware passed."
2013-01-15 - "Host has been reimaged."
2013-02-05 - "mysterious crashes"
2013-02-15 - logic board replaced
2013-02-17 - same issues
2013-10-02 - back from HD replacement
2013-10-09 - reported more issues
2013-11-11 - memtest did not find any issues "After running memtest86+ multiples times"
I can only see ourselves replacing the memory and give it one more shot. After that we should decommission it.
Comment 39•11 years ago
|
||
RAM has been replaced.
Putting back into production.
I will check tomorrow.
Assignee: nobody → armenzg
Comment 40•11 years ago
|
||
It looks good.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Assignee: armenzg → nobody
QA Contact: armenzg → bugspam.Callek
Alias: talos-r4-snow-046 → t-snow-r4-0044
Summary: talos-r4-snow-046 problem tracking → t-snow-r4-0044 problem tracking
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•