731304 - (talos-r4-snow-030) talos-r4-snow-030 problem tracking

Reporter

Description

•

12 years ago

      No description provided.

Armen [:armenzg]

Reporter

Updated

•

12 years ago

Alias: talos-r4-snow-030

Summary: talos-r4-snow-030 needs rebooting → talos-r4-snow-030

Chris Cooper [:coop] (he/him)

Updated

•

12 years ago

Priority: -- → P3

Whiteboard: [buildduty][capacity][buildslaves]

Armen [:armenzg]

Reporter

Updated

•

12 years ago

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Mike Taylor [:bear]

Comment 1

•

12 years ago

[12:10]  <nagios-sjc1> [72] talos-r4-snow-030.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%

Status: RESOLVED → REOPENED

Depends on: 742433
No longer depends on: 731291

Resolution: FIXED → ---

Amy Rich [:arr] [:arich]

Updated

•

12 years ago

No longer depends on: 742433

Mike Taylor [:bear]

Comment 2

•

12 years ago

slave is taking jobs - closing, thanks RelOps!

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Summary: talos-r4-snow-030 → talos-r4-snow-030 problem tracking

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

12 years ago

Please reboot.

Status: RESOLVED → REOPENED

Depends on: 760958

Resolution: FIXED → ---

Mike Taylor [:bear]

Comment 4

•

12 years ago

no response via ssh - rebooted via PDU

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

12 years ago

That got it back in production.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

No longer depends on: 760958

Resolution: --- → FIXED

Rail Aliiev [:rail]

Updated

•

12 years ago

Status: RESOLVED → REOPENED

Depends on: 781825

Resolution: FIXED → ---

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

12 years ago

Back at the coal face.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Updated

•

12 years ago

Component: Release Engineering → Release Engineering: Machine Management

QA Contact: armenzg

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

12 years ago

Tried to reboot via the PDU (pdu1.r102-1.build.scl1:BB7) but didn't work, either by Reboot or by Off then On. Please investigate.

Status: RESOLVED → REOPENED

Depends on: 793221

Resolution: FIXED → ---

Van Le [:van]

Comment 8

•

12 years ago

bad onboard NIC, confirmed switch port and cable works fine. will need to bring to apple certified tech to swap out motherboard if possible.


Bug 794184 opened to track.

bhearsum@mozilla.com (:bhearsum)

Updated

•

12 years ago

Depends on: 794184

Chris Cooper [:coop] (he/him)

Updated

•

12 years ago

Depends on: 814260

Chris Cooper [:coop] (he/him)

Comment 9

•

12 years ago

Back in production.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Comment 10

•

11 years ago

Disabled in slavealloc for bad talos results (Clint will post more shortly).

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

cmtalbert

Comment 11

•

11 years ago

So, we are looking at how the new datazilla compares to the graph server and in order to see that jeads put together this page:
http://people.mozilla.com/~jeads/summary.html#

The new datazilla statistics approach calculates a talos test's "passing" by comparing each page of the talos test's results to the historical data for that page. And then the Datazilla outputs the total number of "tests" that passed versus failed.  So, consolidating percent of tests passed across time allows us to generate a similar looking graph to the old graphs.m.o that uses an outdated stats model.  If you look at our per platform breakdown here, you will note that there is a periodic failure happening on Mac 10.6 talos runs. (See the deep valleys on the datazilla graphs and the high outliers on corresponding graphs.m.o graphs).  

Comparing that to the same time sequence on graph server, you will see the same thing with several high outliers which correspond to these same changesets: http://graphs.mozilla.org/graph.html#tests=[[206,63,21]]&sel=none&displayrange=30&datatype=running

And if you look at the specific changesets, here are a sample:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=36a681f8f124
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=f01f7b2cd99a
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=782e3ab94db7
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&noignore=1&rev=bc9d2b47cda8

You will see that each of these are using this machine.  And all the runs on this machine are uniformly creating outliers in our test.  So this machine is either slower or has less resources than the other mac os x 10.6 boxes.  Either way, we'd like to remove this machine from the talos pool for the time being until we find out what is happening here.

Aki Sasaki (not active)

Updated

•

11 years ago

Whiteboard: [buildduty][capacity][buildslaves] → [buildduty][capacity][buildslaves][badslave?]

bhearsum@mozilla.com (:bhearsum)

Comment 12

•

11 years ago

So, should we decomm this machine given comment #11?

Flags: needinfo?(hwine)

Hal Wine [:hwine] use NI!

Comment 13

•

11 years ago

pull from pools, but hold for parts and/or deeper diagnosis. (i.e. in case we start running short of r4 machines prior to being able to decomm all of them.)

Flags: needinfo?(hwine)

Armen [:armenzg]

Reporter

Comment 14

•

11 years ago

Attached image screenshot — Details

I can confirm that there were some outliers (see attachment) during the time that this slave was running in January.

We don't run the "Tp5 No Network Row Major MozAfterPaint" job anymore.
Should we try putting the machine back into the pool?
Is there a way to determine if this machine was giving trouble in other talos jobs?

Chronology of events:
* in comment #7 (23-09-2012) we asked for a reboot in bug 793221
* in comment #8 we determined that the machine has a bad NIC and it needs apple care (bug 794184)
* on 2012-12-04 we request for the dongles to be checked in bug 814260 (since it came with dongle problems from the repair).
* after 2 months we're asked to disable the slave.

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Whiteboard: [buildduty][capacity][buildslaves][badslave?] → [buildduty][capacity][buildslaves][badslave?][needs diagnostics]

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Depends on: 885865

bhearsum@mozilla.com (:bhearsum)

Updated

•

11 years ago

Depends on: 889457

bhearsum@mozilla.com (:bhearsum)

Comment 15

•

11 years ago

(In reply to Hal Wine [:hwine] from comment #13)
> pull from pools, but hold for parts and/or deeper diagnosis. (i.e. in case
> we start running short of r4 machines prior to being able to decomm all of
> them.)

Give this, there's nothing left to do here. I'm going to leave the machine in buildbot-configs and such in case we figure out how to recover it. I'm closing this bug though.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 16

•

11 years ago

Seems to be the only slave that manages to timeout in test_prompt_async.html (see bug 870175 comment 7). And just looking at the overall slave health, looks like it's pretty flaky overall.

Disabled in slavealloc.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

bhearsum@mozilla.com (:bhearsum)

Comment 17

•

11 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #16)
> Seems to be the only slave that manages to timeout in test_prompt_async.html
> (see bug 870175 comment 7). And just looking at the overall slave health,
> looks like it's pretty flaky overall.
> 
> Disabled in slavealloc.

I don't know why this slave was put back into production based on comment #13 and onwards. No idea what to do with this slave.

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Whiteboard: [buildduty][capacity][buildslaves][badslave?][needs diagnostics] → [buildduty][capacity][buildslaves][badslave?][decomm]

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Depends on: 928102

Armen [:armenzg]

Reporter

Comment 18

•

11 years ago

Updated fields on slavealloc.

Taking to verify that our health reports don't show it anymore.

Assignee: nobody → armenzg

Armen [:armenzg]

Reporter

Comment 19

•

11 years ago

It doesn't show up on the health reports anymore:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=talos-r4-snow

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Reporter

Updated

•

10 years ago

Assignee: armenzg → nobody

QA Contact: armenzg → bugspam.Callek

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard