Closed Bug 1464064 Opened 7 years ago Closed 6 years ago

Moonshot Linux nodes stop functioning

Categories

(Infrastructure & Operations :: RelOps: General, task)

Product:

Component:

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: arny, Assigned: dhouse)

References

Details

Attila Craciun [:arny]

Reporter

Description

•

7 years ago

We have found that the bellow Linux servers are not visible in TC and not taking jobs. We will re-image them and update this bug. t-linux64-ms-193 t-linux64-ms-279 t-linux64-ms-280 t-linux64-ms-484 t-linux64-ms-495 t-linux64-ms-527 t-linux64-ms-580

Kendall Libby [:fubar] (he/him)

Comment 1

•

7 years ago

Can we get a link to the papertrail logs for some of these? This sounds suspiciously like what we're seeing on the w10 nodes, where they suddenly stop working.

Zsolt Fay [:zfay]

Updated

•

7 years ago

Blocks: t-linux64-ms-280

Zsolt Fay [:zfay]

Updated

•

7 years ago

Blocks: 1464073

Zsolt Fay [:zfay]

Updated

•

7 years ago

Blocks: 1464080

Assignee

Comment 2

•

7 years ago

good results for t-linux-ms-495 (https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-495). I think we need to do the same reimage and then watch and possibly cold-reboot if first boot after reimage gets stuck. I'll check 279 and 280 next (that get stuck at pxeboot menu).

Assignee

Comment 3

•

7 years ago

(In reply to Kendall Libby [:fubar] from comment #1) > Can we get a link to the papertrail logs for some of these? This sounds > suspiciously like what we're seeing on the w10 nodes, where they suddenly > stop working. Here is all of them in papertrail: https://papertrailapp.com/groups/6937292/events?q=t-linux64-ms-193%20OR%20%20t-linux64-ms-279%20OR%20%20t-linux64-ms-280%20OR%20%20t-linux64-ms-484%20OR%20%20t-linux64-ms-495%20OR%20%20t-linux64-ms-527%20OR%20%20t-linux64-ms-580&focus=936334690644828165 Not much for the ones that are sticking at the pxeboot however. But ones like #495 show logs once they are repaired: https://papertrailapp.com/systems/1899813261/events?focus=936334699754856501

Assignee

Comment 4

•

7 years ago

There is a tracking bug for hardware issues on the moonshots: https://bugzilla.mozilla.org/show_bug.cgi?id=1428159 If we find a hardware issue with any of these we can address it through that bug.

See Also: → 1428159

Assignee

Comment 5

•

7 years ago

stuck on pxeboot: t-linux64-ms-193 (after failing pxeboot, goes into xen currently) t-linux64-ms-279 (after failing pxeboot, goes into ubuntu but without tc-worker running) t-linux64-ms-280 (after failing pxeboot, goes into ubuntu but without tc-worker running) fixed by reimage: t-linux64-ms-484 t-linux64-ms-495 t-linux64-ms-527 okay: 571-580 are not in production. they are a development set t-linux64-ms-580 (we expect this to be off or not running tc-worker

Assignee

Comment 6

•

7 years ago

t-linux64-ms-488 also came up as not running the tc worker. This one needs to be reimaged as it has a problem with its puppet certificate and cannot update its puppet config. https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-488

Roland Mutter Michael (:rmutter)

Comment 7

•

7 years ago

I've done the reimage on t-linux-ms-488. Seems like it appears in TC and it's taking tasks. We still cannot SSH into it.

Attila Craciun [:arny]

Reporter

Comment 8

•

7 years ago

The bellow Linux servers where not present in the TC list, however, I was able to see each ones tasks. I have rebooted all and they run tasks successfully. t-linux64-ms-007 t-linux64-ms-057 t-linux64-ms-141 t-linux64-ms-183 t-linux64-ms-189 t-linux64-ms-493

Attila Craciun [:arny]

Reporter

Comment 9

•

7 years ago

(In reply to Attila Craciun [:arny] from comment #8) > The bellow Linux servers where not present in the TC list, however, I was > able to see each ones tasks. I have rebooted all and they run tasks > successfully. > > t-linux64-ms-007 > t-linux64-ms-057 > t-linux64-ms-141 > t-linux64-ms-183 > t-linux64-ms-189 > t-linux64-ms-493 PXE is not working also for this machines.

Attila Craciun [:arny]

Reporter

Comment 10

•

7 years ago

The bellow servers where not visible in TC. After checking them, all where stuck at grub menu. Rebooted them, they show up in TC, running and completing jobs successfully. t-linux64-ms-272 t-linux64-ms-273 t-linux64-ms-274 t-linux64-ms-275 (need firmware upgrade bug 1464044) t-linux64-ms-276 t-linux64-ms-277

Assignee

Comment 11

•

7 years ago

(In reply to Attila Craciun [:arny] from comment #9) > (In reply to Attila Craciun [:arny] from comment #8) > > The bellow Linux servers where not present in the TC list, however, I was > > able to see each ones tasks. I have rebooted all and they run tasks > > successfully. > > > > t-linux64-ms-007 > > t-linux64-ms-057 > > t-linux64-ms-141 > > t-linux64-ms-183 > > t-linux64-ms-189 > > t-linux64-ms-493 > > PXE is not working also for this machines. I don't have PXE working yet for the mdc1 moonshots.

Assignee

Comment 12

•

7 years ago

(In reply to Dave House [:dhouse] from comment #11) > (In reply to Attila Craciun [:arny] from comment #9) > > (In reply to Attila Craciun [:arny] from comment #8) > > > The bellow Linux servers where not present in the TC list, however, I was > > > able to see each ones tasks. I have rebooted all and they run tasks > > > successfully. > > > > > > t-linux64-ms-007 > > > t-linux64-ms-057 > > > t-linux64-ms-141 > > > t-linux64-ms-183 > > > t-linux64-ms-189 > > > t-linux64-ms-493 > > > > PXE is not working also for this machines. > > I don't have PXE working yet for the mdc1 moonshots. I changed all of the linux nodes on the mdc1 and mdc2 moonshots to boot from their local hard-disks instead of doing PXE-boot first. So, this can prevent machines that reboot from wasting time trying to pxe-boot.

Zsolt Fay [:zfay]

Comment 13

•

7 years ago

As an update on linux moonshots: t-linux64-ms-193 and t-linux64-ms-275 are out of service. This is also stated in the MS document t-linux64-ms-279 and t-linux64-ms-280 were missing from TC. I re-imaged them, the process went all the way. Machine 279 was assigned to dividehex when it was last broken as per bug 1435020. t-linux64-ms-394 however won't even go through PXE boot. Looks like it's trying but keeps getting back to the beginning of PXE boot.

Attila Craciun [:arny]

Reporter

Comment 14

•

7 years ago

t-linux64-ms-257 - rebooted, was not present in TC. Now is back in business.

Assignee

Comment 15

•

7 years ago

I see that t-linux64-ms-394 is working: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394 Danut, could you have someone on your team check over the others to see if they are in the same state now or fixed?

Roland Mutter Michael (:rmutter)

Comment 16

•

7 years ago

(In reply to Dave House [:dhouse] from comment #15) > I see that t-linux64-ms-394 is working: > https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/ > gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394 > > Danut, could you have someone on your team check over the others to see if > they are in the same state now or fixed? :dhouse, 279 and 280 are missing once again from TC . Shall we keep them in that state for further investigation or shall we re-image 'em once again?

Flags: needinfo?(dhouse)

Assignee

Comment 17

•

7 years ago

(In reply to Roland Mutter Michael (:rmutter) from comment #16) > (In reply to Dave House [:dhouse] from comment #15) > > I see that t-linux64-ms-394 is working: > > https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/ > > gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394 > > > > Danut, could you have someone on your team check over the others to see if > > they are in the same state now or fixed? > > :dhouse, 279 and 280 are missing once again from TC . Shall we keep them in > that state for further investigation or shall we re-image 'em once again? :rmutter, please re-image them once again. If this repeats, we can review the logs to see what has happened to cause them to stop taking jobs.

Flags: needinfo?(dhouse)

Assignee

Comment 18

•

7 years ago

Adding some new nodes seeing this problem from #ci: > 21:32:51 <&riman|ciduty> Hello dhouse: The following t-linux64-ms-(351, 356, 357, 436) workers are missing from Taskcluster. I have tried to re-image them but they all remain stuck after F12. Could you take a look, please? t-linux64-ms-351 t-linux64-ms-356 t-linux64-ms-357 t-linux64-ms-436

Assignee

Updated

•

7 years ago

Assignee: relops → dhouse

Assignee

Comment 19

•

7 years ago

(In reply to Dave House [:dhouse] from comment #18) > Adding some new nodes seeing this problem from #ci: > > 21:32:51 <&riman|ciduty> Hello dhouse: The following t-linux64-ms-(351, 356, 357, 436) workers are missing from Taskcluster. I have tried to re-image them but they all remain stuck after F12. Could you take a look, please? > > t-linux64-ms-351 > t-linux64-ms-356 > t-linux64-ms-357 > t-linux64-ms-436 I see the same hanging at "Booting PXE over IPv4" in mdc2 chassis 8, 9 and 11 (10, 12, 13, 14 are not having this problem). I spot-checked across all of these mdc2 chassis. Also, the 4 above are never ping-able (and while spot-checking i found t-linux64-ms-346 had this problem 2 of 3 times I rebooted it. so I think this may be intermittent across others also); When I boot them from their local ubuntu install, they go into the "raise the network interfaces" waiting period and then give up without network (and are not pingable). So I tried changing back to the vm admin hosts for pxeboot and the failing machines still did not get farther in pxeboot (so I reverted that back to the correct new admin hosts for pxe/tftp).

Assignee

Comment 20

•

7 years ago

Through troubleshooting in #systems, Van found that the problem chassis needed their 2nd switches restarted; See https://mana.mozilla.org/wiki/display/NETOPS/HP+Switch+Configuration#HPSwitchConfiguration-12.Troubleshooting ``` If you see the switches/chassis complaining of a duplicate IP, that means the switch may have lost its IRF config and will need to be rebooted. ex: Duplicate address 10.51.16.34 on interface M-GigabitEthernet0/0/0, sourced from 9cb6-54fe-7cca ``` He fixed 8,9,11 moon chassis and I confirmed by pxebooting two machines from each chassis. I need to check through all of the linux cartridges on these three chassis to make sure none are left thinking that they have no network (or needing reimaged).

Assignee

Comment 21

•

7 years ago

We also need to set up some sort of monitoring to be alerted if the switch problem happens again (since we do not know what caused it).

Roland Mutter Michael (:rmutter)

Comment 22

•

7 years ago

Went for a full check of linux moonshots that does appear in TC. Seems like the following machines are not in TC : t-linux64-ms-141 t-linux64-ms-351 t-linux64-ms-356 t-linux64-ms-357 t-linux64-ms-436 Will proceed with a reboot for every machine. If that doesn't work , I'll start a reimage for each one. I'll be back with updates.

Assignee

Comment 23

•

7 years ago

(In reply to Roland Mutter Michael (:rmutter) from comment #22) > Went for a full check of linux moonshots that does appear in TC. Seems like > the following machines are not in TC : > t-linux64-ms-141 > t-linux64-ms-351 > t-linux64-ms-356 > t-linux64-ms-357 > t-linux64-ms-436 > > Will proceed with a reboot for every machine. If that doesn't work , I'll > start a reimage for each one. I'll be back with updates. Thank you! I appreciate your work on these.

Roland Mutter Michael (:rmutter)

Comment 24

•

7 years ago

After rebooting the machines, the following and the candidates for reimage: t-linux64-ms-351 t-linux64-ms-356 t-linux64-ms-357 t-linux64-ms-436

Roland Mutter Michael (:rmutter)

Comment 25

•

7 years ago

:dhouse I saw from previous shifts that reimaging for 356, 357 and 436 is disabled. Please ping us whenever they are ready for the reimage. For now, Adrian will reimage t-linux64-ms-351.

Assignee

Comment 26

•

7 years ago

(In reply to Roland Mutter Michael (:rmutter) from comment #25) > :dhouse I saw from previous shifts that reimaging for 356, 357 and 436 is > disabled. Please ping us whenever they are ready for the reimage. For now, > Adrian will reimage t-linux64-ms-351. Thankyou. We were able to get the reimaging fixed (the network switches in moon8/9/11 had lost some config and had to be reconfigured). I'll reimage 356,357,436 to make sure that works on them.

Assignee

Comment 27

•

7 years ago

I started reimaging 356,357,436. I'll check for them to puppetize and start taking work. Confirmed all three were not in taskcluster taking jobs before reimaging: ``` t-linux64-ms-356.test.releng.mdc2.mozilla.com https://moon-chassis-9.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c11n1 Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356 t-linux64-ms-357.test.releng.mdc2.mozilla.com https://moon-chassis-9.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c12n1 Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357 t-linux64-ms-436.test.releng.mdc2.mozilla.com https://moon-chassis-11.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c1n1 Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436 ```

Assignee

Comment 28

•

7 years ago

356,357,436 are reimaged and now running jobs: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436

Comment 29

•

7 years ago

I've reimaged t-linux64-ms-351, later it needs to be checked if it takes jobs

Roland Mutter Michael (:rmutter)

Comment 30

•

7 years ago

(In reply to Adrian Pop from comment #29) > I've reimaged t-linux64-ms-351, later it needs to be checked if it takes jobs Looks good in TC: https://tools.taskcluster.net/groups/W1Mde5F9Rpm8QrPEqMl2Hg/tasks/PEvc-MzET6KLNaGyaclDWg/runs/0

Assignee

Comment 31

•

7 years ago

All of the machines reported in this bug are accounted for and working correctly now (279 and 280 are loaners, all others were in a good state): https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-007 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-057 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-141 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-183 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-189 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-193 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-272 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-273 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-274 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-275 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-276 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-277 Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-279 Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-280 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-351 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-484 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-493 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-495 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-527 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-580 The two missing from taskcluster are 279,280: t-linux64-ms-279.test.releng.mdc1.mozilla.com https://moon-chassis-7.inband.releng.mdc1.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c9n1 t-linux64-ms-280.test.releng.mdc1.mozilla.com https://moon-chassis-7.inband.releng.mdc1.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c10n1 279 was repaired by re-seating the cartridge in bug 1435020 (created bug 1472727 this morning to track it as a loaner) 280 is a loaner for Dragos (see bug 1464070)

No longer blocks: t-linux64-ms-280

Status: NEW → RESOLVED

Closed: 7 years ago

Depends on: 1435020, t-linux64-ms-280

Resolution: --- → FIXED

Assignee

Updated

•

7 years ago

Depends on: 1473589

Zsolt Fay [:zfay]

Comment 32

•

7 years ago

gonna keep tracking and updating this bug with linux machines that fail. t-linux64-ms-527 <-- rebooted, reimaged, back in TC, waiting for jobs.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Danut Labici [:dlabici]

Updated

•

7 years ago

Blocks: t-linux64-ms-580

Assignee

Comment 33

•

7 years ago

527 looks good. 580 is a dev machine https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-527

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Radu Iman[:riman]

Updated

•

6 years ago

Blocks: t-linux64-ms-193

Radu Iman[:riman]

Updated

•

6 years ago

Blocks: t-linux64-ms-275

Zsolt Fay [:zfay]

Comment 34

•

6 years ago

I've re-imaged a large amount of linux moonshot machines which apparently all failed in under 12h: linux-{001, 003, 005, 008, 011, 014, 015, 047, 049, 051, 054, 056, 058, 092, 094, 097, 098, 100, 101, 103, 104, 136, 137, 140, 141, 144, 146, 148, 149, 181, 183, 185, 187, 188, 192, 193, 195, 226, 227, 232, 235, 237, 239, 271, 272, 273, 275, 276, 277, 279, 346, 353, 538} Dave, could this be related to the firmware upgrade you brought to the moonshots? Also had considerably more W10 workers to deal with today, compared to last 2 weeks.

Status: RESOLVED → REOPENED

Flags: needinfo?(dhouse)

Resolution: FIXED → ---

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 35

•

6 years ago

(In reply to Zsolt Fay [:zsoltfay] from comment #34) > I've re-imaged a large amount of linux moonshot machines which apparently > all failed in under 12h: > > linux-{001, 003, 005, 008, 011, 014, 015, 047, 049, 051, 054, 056, 058, 092, > 094, 097, 098, 100, 101, 103, 104, 136, 137, 140, 141, 144, 146, 148, 149, > 181, 183, 185, 187, 188, 192, 193, 195, 226, 227, 232, 235, 237, 239, 271, > 272, 273, 275, 276, 277, 279, 346, 353, 538} > > Dave, could this be related to the firmware upgrade you brought to the > moonshots? Also had considerably more W10 workers to deal with today, > compared to last 2 weeks. In the last 48h, 79 linux moonshots have been re-imaged but not recovered from that state. At this point I'm afraid that the deployment process is bad, since some of the machines have been re-imaged at least once, most of them 2-3 times (tracking for that can be found in the following doc https://docs.google.com/spreadsheets/d/1A6fU2t3rVY2oAd-U26k4lPZGjfULnh6w5XySqsMofUM) and still appear in a bad state. At this point, the deploy process looks to be normal, no major reason why they fail to take tasks. I'll continue investigate the issue and report in if I'll find something obvious.

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 36

•

6 years ago

Later updates, after looking into services the first 3 machines I checked have this process (719) running: [root@t-linux64-ms-005 ~]# ps -ef | grep puppet root 719 714 0 03:09 ? 00:00:00 /bin/bash /root/puppetize.sh root 5428 5416 0 17:22 pts/0 00:00:00 grep --color=auto puppet [root@t-linux64-ms-005 ~]# Doing > cat last_run_report.yaml |grep fail we got > status: failed Also: [root@t-linux64-ms-005 state]# cat /var/lib/puppet/state/last_run_summary.yaml --- version: config: remotes/origin/HEAD puppet: "3.8.5" resources: changed: 4 failed: 1 failed_to_restart: 0 out_of_sync: 5 restarted: 0 scheduled: 0 skipped: 1 total: 456 time: anchor: 0.004462988 augeas: 0.291225276 config_retrieval: 14.035952311998699 exec: 0.374154974 file: 1.0897829879999996 filebucket: 8.704e-05 firewall: 0.009833629000000002 firewallchain: 0.001217386 group: 0.000156733 host: 0.000386595 package: 4.367681283000001 resources: 0.000127482 schedule: 0.00052634 service: 0.8251894579999998 sysctl: 0.000188186 total: 21.0020116199987 user: 0.0010389499999999999 last_run: 1537144918 changes: total: 4 events: failure: 1 success: 4 total: 5 [root@t-linux64-ms-005 state]# and from papertrail we got this > message: "change from stopped to running failed: Could not start Service[mig-agent]: Execution of '/bin/systemctl start mig-agent' returned 5: Failed to start mig-agent.service: Unit mig-agent.service not found." and puppetize.log contains a lot of: > Running puppet agent against server 'puppet' > Puppet run failed; re-trying after 10m Also started the puppet service on the first machine (t-linux64-ms-001) and looked into papertrail and found this: > Sep 16 18:04:59 t-linux64-ms-001.test.releng.mdc1.mozilla.com puppet-agent: (/File[/var/lib/puppet/lib]) Could not evaluate: Could not retrieve file metadata for puppet://releng-puppet2.srv.releng.scl3.mozilla.com/plugins: Failed to open TCP connection to releng-puppet2.srv.releng.scl3.mozilla.com:8140 (Connection timed out - connect(2) for "releng-puppet2.srv.releng.scl3.mozilla.com" port 8140) Is that server being shutdown?? bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com releng-puppet2.srv.releng.scl3.mozilla.com is unreachable

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 37

•

6 years ago

(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #36) > Is that server being shutdown?? > bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com > releng-puppet2.srv.releng.scl3.mozilla.com is unreachable I'm going to answer to that, Yes it is down and probably the mdc1 puppet server should be used because the workers are in MDC1

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Updated

•

6 years ago

Depends on: 1491732

Comment 38

•

6 years ago

I've tried to reboot the following, all of them were powered off. After Power on the machines restarted a few times without any success on booting up OS. After a few restarts all off them got on power off state : t-linux64-ms-272 t-linux64-ms-273 t-linux64-ms-276 t-linux64-ms-277

Assignee

Comment 39

•

6 years ago

We have the moonshots configured to power off after 3 failed boots. These four {272,273,276,277} have been restarted or started since then and are working correctly: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-272 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-273 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-276 https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-277

Status: REOPENED → RESOLVED

Closed: 7 years ago → 6 years ago

Flags: needinfo?(dhouse)

Resolution: --- → FIXED

Kendall Libby [:fubar] (he/him)

Comment 40

•

6 years ago

(In reply to Dave House [:dhouse] from comment #39) > We have the moonshots configured to power off after 3 failed boots. If we're going to stick with that then we should be getting alerts when that happens, either from nagios or iLO.

Kendall Libby [:fubar] (he/him)

Updated

•

6 years ago

Depends on: 1493981

You need to log in before you can comment on or make changes to this bug.