Closed
Bug 731304
(talos-r4-snow-030)
Opened 12 years ago
Closed 11 years ago
talos-r4-snow-030 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Unassigned)
References
Details
(Whiteboard: [buildduty][capacity][buildslaves][badslave?][decomm])
Attachments
(1 file)
17.29 KB,
image/png
|
Details |
No description provided.
Reporter | ||
Updated•12 years ago
|
Alias: talos-r4-snow-030
Summary: talos-r4-snow-030 needs rebooting → talos-r4-snow-030
Updated•12 years ago
|
Priority: -- → P3
Whiteboard: [buildduty][capacity][buildslaves]
Reporter | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 1•12 years ago
|
||
[12:10] <nagios-sjc1> [72] talos-r4-snow-030.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
Comment 2•12 years ago
|
||
slave is taking jobs - closing, thanks RelOps!
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Summary: talos-r4-snow-030 → talos-r4-snow-030 problem tracking
Comment 3•12 years ago
|
||
Please reboot.
Comment 4•12 years ago
|
||
no response via ssh - rebooted via PDU
Comment 5•12 years ago
|
||
That got it back in production.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
No longer depends on: 760958
Resolution: --- → FIXED
Updated•12 years ago
|
Comment 6•12 years ago
|
||
Back at the coal face.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: armenzg
Comment 7•12 years ago
|
||
Tried to reboot via the PDU (pdu1.r102-1.build.scl1:BB7) but didn't work, either by Reboot or by Off then On. Please investigate.
Comment 8•12 years ago
|
||
bad onboard NIC, confirmed switch port and cable works fine. will need to bring to apple certified tech to swap out motherboard if possible. Bug 794184 opened to track.
Comment 9•12 years ago
|
||
Back in production.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 10•11 years ago
|
||
Disabled in slavealloc for bad talos results (Clint will post more shortly).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 11•11 years ago
|
||
So, we are looking at how the new datazilla compares to the graph server and in order to see that jeads put together this page: http://people.mozilla.com/~jeads/summary.html# The new datazilla statistics approach calculates a talos test's "passing" by comparing each page of the talos test's results to the historical data for that page. And then the Datazilla outputs the total number of "tests" that passed versus failed. So, consolidating percent of tests passed across time allows us to generate a similar looking graph to the old graphs.m.o that uses an outdated stats model. If you look at our per platform breakdown here, you will note that there is a periodic failure happening on Mac 10.6 talos runs. (See the deep valleys on the datazilla graphs and the high outliers on corresponding graphs.m.o graphs). Comparing that to the same time sequence on graph server, you will see the same thing with several high outliers which correspond to these same changesets: http://graphs.mozilla.org/graph.html#tests=[[206,63,21]]&sel=none&displayrange=30&datatype=running And if you look at the specific changesets, here are a sample: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=36a681f8f124 https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=f01f7b2cd99a https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=782e3ab94db7 https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=bc9d2b47cda8 You will see that each of these are using this machine. And all the runs on this machine are uniformly creating outliers in our test. So this machine is either slower or has less resources than the other mac os x 10.6 boxes. Either way, we'd like to remove this machine from the talos pool for the time being until we find out what is happening here.
Updated•11 years ago
|
Whiteboard: [buildduty][capacity][buildslaves] → [buildduty][capacity][buildslaves][badslave?]
Comment 12•11 years ago
|
||
So, should we decomm this machine given comment #11?
Flags: needinfo?(hwine)
Comment 13•11 years ago
|
||
pull from pools, but hold for parts and/or deeper diagnosis. (i.e. in case we start running short of r4 machines prior to being able to decomm all of them.)
Flags: needinfo?(hwine)
Reporter | ||
Comment 14•11 years ago
|
||
I can confirm that there were some outliers (see attachment) during the time that this slave was running in January. We don't run the "Tp5 No Network Row Major MozAfterPaint" job anymore. Should we try putting the machine back into the pool? Is there a way to determine if this machine was giving trouble in other talos jobs? Chronology of events: * in comment #7 (23-09-2012) we asked for a reboot in bug 793221 * in comment #8 we determined that the machine has a bad NIC and it needs apple care (bug 794184) * on 2012-12-04 we request for the dongles to be checked in bug 814260 (since it came with dongle problems from the repair). * after 2 months we're asked to disable the slave.
Updated•11 years ago
|
Whiteboard: [buildduty][capacity][buildslaves][badslave?] → [buildduty][capacity][buildslaves][badslave?][needs diagnostics]
Comment 15•11 years ago
|
||
(In reply to Hal Wine [:hwine] from comment #13) > pull from pools, but hold for parts and/or deeper diagnosis. (i.e. in case > we start running short of r4 machines prior to being able to decomm all of > them.) Give this, there's nothing left to do here. I'm going to leave the machine in buildbot-configs and such in case we figure out how to recover it. I'm closing this bug though.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 11 years ago
Resolution: --- → FIXED
Comment 16•11 years ago
|
||
Seems to be the only slave that manages to timeout in test_prompt_async.html (see bug 870175 comment 7). And just looking at the overall slave health, looks like it's pretty flaky overall. Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 17•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #16) > Seems to be the only slave that manages to timeout in test_prompt_async.html > (see bug 870175 comment 7). And just looking at the overall slave health, > looks like it's pretty flaky overall. > > Disabled in slavealloc. I don't know why this slave was put back into production based on comment #13 and onwards. No idea what to do with this slave.
Updated•11 years ago
|
Whiteboard: [buildduty][capacity][buildslaves][badslave?][needs diagnostics] → [buildduty][capacity][buildslaves][badslave?][decomm]
Reporter | ||
Comment 18•11 years ago
|
||
Updated fields on slavealloc. Taking to verify that our health reports don't show it anymore.
Assignee: nobody → armenzg
Reporter | ||
Comment 19•11 years ago
|
||
It doesn't show up on the health reports anymore: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=talos-r4-snow
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•10 years ago
|
Assignee: armenzg → nobody
QA Contact: armenzg → bugspam.Callek
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•