Closed Bug 731304 (talos-r4-snow-030) Opened 12 years ago Closed 11 years ago

talos-r4-snow-030 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

(Whiteboard: [buildduty][capacity][buildslaves][badslave?][decomm])

Attachments

(1 file)

      No description provided.
Alias: talos-r4-snow-030
Summary: talos-r4-snow-030 needs rebooting → talos-r4-snow-030
Priority: -- → P3
Whiteboard: [buildduty][capacity][buildslaves]
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
[12:10]  <nagios-sjc1> [72] talos-r4-snow-030.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
Status: RESOLVED → REOPENED
Depends on: 742433
No longer depends on: 731291
Resolution: FIXED → ---
No longer depends on: 742433
slave is taking jobs - closing, thanks RelOps!
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Summary: talos-r4-snow-030 → talos-r4-snow-030 problem tracking
Please reboot.
Status: RESOLVED → REOPENED
Depends on: 760958
Resolution: FIXED → ---
no response via ssh - rebooted via PDU
That got it back in production.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
No longer depends on: 760958
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Depends on: 781825
Resolution: FIXED → ---
Back at the coal face.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Component: Release Engineering → Release Engineering: Machine Management
QA Contact: armenzg
Tried to reboot via the PDU (pdu1.r102-1.build.scl1:BB7) but didn't work, either by Reboot or by Off then On. Please investigate.
Status: RESOLVED → REOPENED
Depends on: 793221
Resolution: FIXED → ---
bad onboard NIC, confirmed switch port and cable works fine. will need to bring to apple certified tech to swap out motherboard if possible.


Bug 794184 opened to track.
Depends on: 794184
Depends on: 814260
Back in production.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Disabled in slavealloc for bad talos results (Clint will post more shortly).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
So, we are looking at how the new datazilla compares to the graph server and in order to see that jeads put together this page:
http://people.mozilla.com/~jeads/summary.html#

The new datazilla statistics approach calculates a talos test's "passing" by comparing each page of the talos test's results to the historical data for that page. And then the Datazilla outputs the total number of "tests" that passed versus failed.  So, consolidating percent of tests passed across time allows us to generate a similar looking graph to the old graphs.m.o that uses an outdated stats model.  If you look at our per platform breakdown here, you will note that there is a periodic failure happening on Mac 10.6 talos runs. (See the deep valleys on the datazilla graphs and the high outliers on corresponding graphs.m.o graphs).  

Comparing that to the same time sequence on graph server, you will see the same thing with several high outliers which correspond to these same changesets: http://graphs.mozilla.org/graph.html#tests=[[206,63,21]]&sel=none&displayrange=30&datatype=running

And if you look at the specific changesets, here are a sample:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=36a681f8f124
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=f01f7b2cd99a
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=782e3ab94db7
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=bc9d2b47cda8

You will see that each of these are using this machine.  And all the runs on this machine are uniformly creating outliers in our test.  So this machine is either slower or has less resources than the other mac os x 10.6 boxes.  Either way, we'd like to remove this machine from the talos pool for the time being until we find out what is happening here.
Whiteboard: [buildduty][capacity][buildslaves] → [buildduty][capacity][buildslaves][badslave?]
So, should we decomm this machine given comment #11?
Flags: needinfo?(hwine)
pull from pools, but hold for parts and/or deeper diagnosis. (i.e. in case we start running short of r4 machines prior to being able to decomm all of them.)
Flags: needinfo?(hwine)
Attached image screenshot
I can confirm that there were some outliers (see attachment) during the time that this slave was running in January.

We don't run the "Tp5 No Network Row Major MozAfterPaint" job anymore.
Should we try putting the machine back into the pool?
Is there a way to determine if this machine was giving trouble in other talos jobs?

Chronology of events:
* in comment #7 (23-09-2012) we asked for a reboot in bug 793221
* in comment #8 we determined that the machine has a bad NIC and it needs apple care (bug 794184)
* on 2012-12-04 we request for the dongles to be checked in bug 814260 (since it came with dongle problems from the repair).
* after 2 months we're asked to disable the slave.
Whiteboard: [buildduty][capacity][buildslaves][badslave?] → [buildduty][capacity][buildslaves][badslave?][needs diagnostics]
Depends on: 885865
Depends on: 889457
(In reply to Hal Wine [:hwine] from comment #13)
> pull from pools, but hold for parts and/or deeper diagnosis. (i.e. in case
> we start running short of r4 machines prior to being able to decomm all of
> them.)

Give this, there's nothing left to do here. I'm going to leave the machine in buildbot-configs and such in case we figure out how to recover it. I'm closing this bug though.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Seems to be the only slave that manages to timeout in test_prompt_async.html (see bug 870175 comment 7). And just looking at the overall slave health, looks like it's pretty flaky overall.

Disabled in slavealloc.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Product: mozilla.org → Release Engineering
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #16)
> Seems to be the only slave that manages to timeout in test_prompt_async.html
> (see bug 870175 comment 7). And just looking at the overall slave health,
> looks like it's pretty flaky overall.
> 
> Disabled in slavealloc.

I don't know why this slave was put back into production based on comment #13 and onwards. No idea what to do with this slave.
Whiteboard: [buildduty][capacity][buildslaves][badslave?][needs diagnostics] → [buildduty][capacity][buildslaves][badslave?][decomm]
Depends on: 928102
Updated fields on slavealloc.

Taking to verify that our health reports don't show it anymore.
Assignee: nobody → armenzg
It doesn't show up on the health reports anymore:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=talos-r4-snow
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Assignee: armenzg → nobody
QA Contact: armenzg → bugspam.Callek
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: