Closed
Bug 1464064
Opened 7 years ago
Closed 6 years ago
Moonshot Linux nodes stop functioning
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arny, Assigned: dhouse)
References
Details
We have found that the bellow Linux servers are not visible in TC and not taking jobs. We will re-image them and update this bug.
t-linux64-ms-193
t-linux64-ms-279
t-linux64-ms-280
t-linux64-ms-484
t-linux64-ms-495
t-linux64-ms-527
t-linux64-ms-580
Comment 1•7 years ago
|
||
Can we get a link to the papertrail logs for some of these? This sounds suspiciously like what we're seeing on the w10 nodes, where they suddenly stop working.
Updated•7 years ago
|
Blocks: t-linux64-ms-280
good results for t-linux-ms-495 (https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-495). I think we need to do the same reimage and then watch and possibly cold-reboot if first boot after reimage gets stuck.
I'll check 279 and 280 next (that get stuck at pxeboot menu).
(In reply to Kendall Libby [:fubar] from comment #1)
> Can we get a link to the papertrail logs for some of these? This sounds
> suspiciously like what we're seeing on the w10 nodes, where they suddenly
> stop working.
Here is all of them in papertrail: https://papertrailapp.com/groups/6937292/events?q=t-linux64-ms-193%20OR%20%20t-linux64-ms-279%20OR%20%20t-linux64-ms-280%20OR%20%20t-linux64-ms-484%20OR%20%20t-linux64-ms-495%20OR%20%20t-linux64-ms-527%20OR%20%20t-linux64-ms-580&focus=936334690644828165
Not much for the ones that are sticking at the pxeboot however. But ones like #495 show logs once they are repaired: https://papertrailapp.com/systems/1899813261/events?focus=936334699754856501
There is a tracking bug for hardware issues on the moonshots:
https://bugzilla.mozilla.org/show_bug.cgi?id=1428159
If we find a hardware issue with any of these we can address it through that bug.
See Also: → 1428159
stuck on pxeboot:
t-linux64-ms-193 (after failing pxeboot, goes into xen currently)
t-linux64-ms-279 (after failing pxeboot, goes into ubuntu but without tc-worker running)
t-linux64-ms-280 (after failing pxeboot, goes into ubuntu but without tc-worker running)
fixed by reimage:
t-linux64-ms-484
t-linux64-ms-495
t-linux64-ms-527
okay: 571-580 are not in production. they are a development set
t-linux64-ms-580 (we expect this to be off or not running tc-worker
t-linux64-ms-488 also came up as not running the tc worker. This one needs to be reimaged as it has a problem with its puppet certificate and cannot update its puppet config.
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-488
Comment 7•7 years ago
|
||
I've done the reimage on t-linux-ms-488. Seems like it appears in TC and it's taking tasks. We still cannot SSH into it.
Reporter | ||
Comment 8•7 years ago
|
||
The bellow Linux servers where not present in the TC list, however, I was able to see each ones tasks. I have rebooted all and they run tasks successfully.
t-linux64-ms-007
t-linux64-ms-057
t-linux64-ms-141
t-linux64-ms-183
t-linux64-ms-189
t-linux64-ms-493
Reporter | ||
Comment 9•7 years ago
|
||
(In reply to Attila Craciun [:arny] from comment #8)
> The bellow Linux servers where not present in the TC list, however, I was
> able to see each ones tasks. I have rebooted all and they run tasks
> successfully.
>
> t-linux64-ms-007
> t-linux64-ms-057
> t-linux64-ms-141
> t-linux64-ms-183
> t-linux64-ms-189
> t-linux64-ms-493
PXE is not working also for this machines.
Reporter | ||
Comment 10•7 years ago
|
||
The bellow servers where not visible in TC. After checking them, all where stuck at grub menu. Rebooted them, they show up in TC, running and completing jobs successfully.
t-linux64-ms-272
t-linux64-ms-273
t-linux64-ms-274
t-linux64-ms-275 (need firmware upgrade bug 1464044)
t-linux64-ms-276
t-linux64-ms-277
Assignee | ||
Comment 11•7 years ago
|
||
(In reply to Attila Craciun [:arny] from comment #9)
> (In reply to Attila Craciun [:arny] from comment #8)
> > The bellow Linux servers where not present in the TC list, however, I was
> > able to see each ones tasks. I have rebooted all and they run tasks
> > successfully.
> >
> > t-linux64-ms-007
> > t-linux64-ms-057
> > t-linux64-ms-141
> > t-linux64-ms-183
> > t-linux64-ms-189
> > t-linux64-ms-493
>
> PXE is not working also for this machines.
I don't have PXE working yet for the mdc1 moonshots.
Assignee | ||
Comment 12•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #11)
> (In reply to Attila Craciun [:arny] from comment #9)
> > (In reply to Attila Craciun [:arny] from comment #8)
> > > The bellow Linux servers where not present in the TC list, however, I was
> > > able to see each ones tasks. I have rebooted all and they run tasks
> > > successfully.
> > >
> > > t-linux64-ms-007
> > > t-linux64-ms-057
> > > t-linux64-ms-141
> > > t-linux64-ms-183
> > > t-linux64-ms-189
> > > t-linux64-ms-493
> >
> > PXE is not working also for this machines.
>
> I don't have PXE working yet for the mdc1 moonshots.
I changed all of the linux nodes on the mdc1 and mdc2 moonshots to boot from their local hard-disks instead of doing PXE-boot first. So, this can prevent machines that reboot from wasting time trying to pxe-boot.
Comment 13•7 years ago
|
||
As an update on linux moonshots:
t-linux64-ms-193 and t-linux64-ms-275 are out of service. This is also stated in the MS document
t-linux64-ms-279 and t-linux64-ms-280 were missing from TC. I re-imaged them, the process went all the way. Machine 279 was assigned to dividehex when it was last broken as per bug 1435020.
t-linux64-ms-394 however won't even go through PXE boot. Looks like it's trying but keeps getting back to the beginning of PXE boot.
Reporter | ||
Comment 14•7 years ago
|
||
t-linux64-ms-257 - rebooted, was not present in TC. Now is back in business.
Assignee | ||
Comment 15•7 years ago
|
||
I see that t-linux64-ms-394 is working: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
Danut, could you have someone on your team check over the others to see if they are in the same state now or fixed?
Comment 16•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #15)
> I see that t-linux64-ms-394 is working:
> https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/
> gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
>
> Danut, could you have someone on your team check over the others to see if
> they are in the same state now or fixed?
:dhouse, 279 and 280 are missing once again from TC . Shall we keep them in that state for further investigation or shall we re-image 'em once again?
Flags: needinfo?(dhouse)
Assignee | ||
Comment 17•7 years ago
|
||
(In reply to Roland Mutter Michael (:rmutter) from comment #16)
> (In reply to Dave House [:dhouse] from comment #15)
> > I see that t-linux64-ms-394 is working:
> > https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/
> > gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
> >
> > Danut, could you have someone on your team check over the others to see if
> > they are in the same state now or fixed?
>
> :dhouse, 279 and 280 are missing once again from TC . Shall we keep them in
> that state for further investigation or shall we re-image 'em once again?
:rmutter, please re-image them once again. If this repeats, we can review the logs to see what has happened to cause them to stop taking jobs.
Flags: needinfo?(dhouse)
Assignee | ||
Comment 18•7 years ago
|
||
Adding some new nodes seeing this problem from #ci:
> 21:32:51 <&riman|ciduty> Hello dhouse: The following t-linux64-ms-(351, 356, 357, 436) workers are missing from Taskcluster. I have tried to re-image them but they all remain stuck after F12. Could you take a look, please?
t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436
Assignee | ||
Comment 19•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #18)
> Adding some new nodes seeing this problem from #ci:
> > 21:32:51 <&riman|ciduty> Hello dhouse: The following t-linux64-ms-(351, 356, 357, 436) workers are missing from Taskcluster. I have tried to re-image them but they all remain stuck after F12. Could you take a look, please?
>
> t-linux64-ms-351
> t-linux64-ms-356
> t-linux64-ms-357
> t-linux64-ms-436
I see the same hanging at "Booting PXE over IPv4" in mdc2 chassis 8, 9 and 11 (10, 12, 13, 14 are not having this problem). I spot-checked across all of these mdc2 chassis.
Also, the 4 above are never ping-able (and while spot-checking i found t-linux64-ms-346 had this problem 2 of 3 times I rebooted it. so I think this may be intermittent across others also); When I boot them from their local ubuntu install, they go into the "raise the network interfaces" waiting period and then give up without network (and are not pingable).
So I tried changing back to the vm admin hosts for pxeboot and the failing machines still did not get farther in pxeboot (so I reverted that back to the correct new admin hosts for pxe/tftp).
Assignee | ||
Comment 20•7 years ago
|
||
Through troubleshooting in #systems, Van found that the problem chassis needed their 2nd switches restarted; See https://mana.mozilla.org/wiki/display/NETOPS/HP+Switch+Configuration#HPSwitchConfiguration-12.Troubleshooting
```
If you see the switches/chassis complaining of a duplicate IP, that means the switch may have lost its IRF config and will need to be rebooted.
ex: Duplicate address 10.51.16.34 on interface M-GigabitEthernet0/0/0, sourced from 9cb6-54fe-7cca
```
He fixed 8,9,11 moon chassis and I confirmed by pxebooting two machines from each chassis.
I need to check through all of the linux cartridges on these three chassis to make sure none are left thinking that they have no network (or needing reimaged).
Assignee | ||
Comment 21•7 years ago
|
||
We also need to set up some sort of monitoring to be alerted if the switch problem happens again (since we do not know what caused it).
Comment 22•7 years ago
|
||
Went for a full check of linux moonshots that does appear in TC. Seems like the following machines are not in TC :
t-linux64-ms-141
t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436
Will proceed with a reboot for every machine. If that doesn't work , I'll start a reimage for each one. I'll be back with updates.
Assignee | ||
Comment 23•7 years ago
|
||
(In reply to Roland Mutter Michael (:rmutter) from comment #22)
> Went for a full check of linux moonshots that does appear in TC. Seems like
> the following machines are not in TC :
> t-linux64-ms-141
> t-linux64-ms-351
> t-linux64-ms-356
> t-linux64-ms-357
> t-linux64-ms-436
>
> Will proceed with a reboot for every machine. If that doesn't work , I'll
> start a reimage for each one. I'll be back with updates.
Thank you! I appreciate your work on these.
Comment 24•7 years ago
|
||
After rebooting the machines, the following and the candidates for reimage:
t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436
Comment 25•7 years ago
|
||
:dhouse I saw from previous shifts that reimaging for 356, 357 and 436 is disabled. Please ping us whenever they are ready for the reimage. For now, Adrian will reimage t-linux64-ms-351.
Assignee | ||
Comment 26•7 years ago
|
||
(In reply to Roland Mutter Michael (:rmutter) from comment #25)
> :dhouse I saw from previous shifts that reimaging for 356, 357 and 436 is
> disabled. Please ping us whenever they are ready for the reimage. For now,
> Adrian will reimage t-linux64-ms-351.
Thankyou. We were able to get the reimaging fixed (the network switches in moon8/9/11 had lost some config and had to be reconfigured).
I'll reimage 356,357,436 to make sure that works on them.
Assignee | ||
Comment 27•7 years ago
|
||
I started reimaging 356,357,436. I'll check for them to puppetize and start taking work.
Confirmed all three were not in taskcluster taking jobs before reimaging:
```
t-linux64-ms-356.test.releng.mdc2.mozilla.com https://moon-chassis-9.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c11n1
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356
t-linux64-ms-357.test.releng.mdc2.mozilla.com https://moon-chassis-9.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c12n1
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357
t-linux64-ms-436.test.releng.mdc2.mozilla.com https://moon-chassis-11.inband.releng.mdc2.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c1n1
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436
```
Assignee | ||
Comment 28•7 years ago
|
||
356,357,436 are reimaged and now running jobs:
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436
Comment 29•7 years ago
|
||
I've reimaged t-linux64-ms-351, later it needs to be checked if it takes jobs
Comment 30•7 years ago
|
||
(In reply to Adrian Pop from comment #29)
> I've reimaged t-linux64-ms-351, later it needs to be checked if it takes jobs
Looks good in TC: https://tools.taskcluster.net/groups/W1Mde5F9Rpm8QrPEqMl2Hg/tasks/PEvc-MzET6KLNaGyaclDWg/runs/0
Assignee | ||
Comment 31•7 years ago
|
||
All of the machines reported in this bug are accounted for and working correctly now (279 and 280 are loaners, all others were in a good state):
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-007
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-057
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-141
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-183
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-189
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-193
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-272
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-273
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-274
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-275
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-276
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-277
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-279
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-280
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-351
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-484
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-493
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-495
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-527
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-580
The two missing from taskcluster are 279,280:
t-linux64-ms-279.test.releng.mdc1.mozilla.com https://moon-chassis-7.inband.releng.mdc1.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c9n1
t-linux64-ms-280.test.releng.mdc1.mozilla.com https://moon-chassis-7.inband.releng.mdc1.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c10n1
279 was repaired by re-seating the cartridge in bug 1435020 (created bug 1472727 this morning to track it as a loaner)
280 is a loaner for Dragos (see bug 1464070)
No longer blocks: t-linux64-ms-280
Status: NEW → RESOLVED
Closed: 7 years ago
Depends on: 1435020, t-linux64-ms-280
Resolution: --- → FIXED
Comment 32•7 years ago
|
||
gonna keep tracking and updating this bug with linux machines that fail.
t-linux64-ms-527 <-- rebooted, reimaged, back in TC, waiting for jobs.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•7 years ago
|
Blocks: t-linux64-ms-580
Assignee | ||
Comment 33•7 years ago
|
||
527 looks good. 580 is a dev machine
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-527
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Blocks: t-linux64-ms-193
Updated•6 years ago
|
Blocks: t-linux64-ms-275
Comment 34•6 years ago
|
||
I've re-imaged a large amount of linux moonshot machines which apparently all failed in under 12h:
linux-{001, 003, 005, 008, 011, 014, 015, 047, 049, 051, 054, 056, 058, 092, 094, 097, 098, 100, 101, 103, 104, 136, 137, 140, 141, 144, 146, 148, 149, 181, 183, 185, 187, 188, 192, 193, 195, 226, 227, 232, 235, 237, 239, 271, 272, 273, 275, 276, 277, 279, 346, 353, 538}
Dave, could this be related to the firmware upgrade you brought to the moonshots? Also had considerably more W10 workers to deal with today, compared to last 2 weeks.
Status: RESOLVED → REOPENED
Flags: needinfo?(dhouse)
Resolution: FIXED → ---
Comment 35•6 years ago
|
||
(In reply to Zsolt Fay [:zsoltfay] from comment #34)
> I've re-imaged a large amount of linux moonshot machines which apparently
> all failed in under 12h:
>
> linux-{001, 003, 005, 008, 011, 014, 015, 047, 049, 051, 054, 056, 058, 092,
> 094, 097, 098, 100, 101, 103, 104, 136, 137, 140, 141, 144, 146, 148, 149,
> 181, 183, 185, 187, 188, 192, 193, 195, 226, 227, 232, 235, 237, 239, 271,
> 272, 273, 275, 276, 277, 279, 346, 353, 538}
>
> Dave, could this be related to the firmware upgrade you brought to the
> moonshots? Also had considerably more W10 workers to deal with today,
> compared to last 2 weeks.
In the last 48h, 79 linux moonshots have been re-imaged but not recovered from that state.
At this point I'm afraid that the deployment process is bad, since some of the machines have been re-imaged at least once, most of them 2-3 times (tracking for that can be found in the following doc https://docs.google.com/spreadsheets/d/1A6fU2t3rVY2oAd-U26k4lPZGjfULnh6w5XySqsMofUM) and still appear in a bad state.
At this point, the deploy process looks to be normal, no major reason why they fail to take tasks.
I'll continue investigate the issue and report in if I'll find something obvious.
Comment 36•6 years ago
|
||
Later updates, after looking into services the first 3 machines I checked have this process (719) running:
[root@t-linux64-ms-005 ~]# ps -ef | grep puppet
root 719 714 0 03:09 ? 00:00:00 /bin/bash /root/puppetize.sh
root 5428 5416 0 17:22 pts/0 00:00:00 grep --color=auto puppet
[root@t-linux64-ms-005 ~]#
Doing
> cat last_run_report.yaml |grep fail
we got
> status: failed
Also:
[root@t-linux64-ms-005 state]# cat /var/lib/puppet/state/last_run_summary.yaml
---
version:
config: remotes/origin/HEAD
puppet: "3.8.5"
resources:
changed: 4
failed: 1
failed_to_restart: 0
out_of_sync: 5
restarted: 0
scheduled: 0
skipped: 1
total: 456
time:
anchor: 0.004462988
augeas: 0.291225276
config_retrieval: 14.035952311998699
exec: 0.374154974
file: 1.0897829879999996
filebucket: 8.704e-05
firewall: 0.009833629000000002
firewallchain: 0.001217386
group: 0.000156733
host: 0.000386595
package: 4.367681283000001
resources: 0.000127482
schedule: 0.00052634
service: 0.8251894579999998
sysctl: 0.000188186
total: 21.0020116199987
user: 0.0010389499999999999
last_run: 1537144918
changes:
total: 4
events:
failure: 1
success: 4
total: 5
[root@t-linux64-ms-005 state]#
and from papertrail we got this
> message: "change from stopped to running failed: Could not start Service[mig-agent]: Execution of '/bin/systemctl start mig-agent' returned 5: Failed to start mig-agent.service: Unit mig-agent.service not found."
and puppetize.log contains a lot of:
> Running puppet agent against server 'puppet'
> Puppet run failed; re-trying after 10m
Also started the puppet service on the first machine (t-linux64-ms-001) and looked into papertrail and found this:
> Sep 16 18:04:59 t-linux64-ms-001.test.releng.mdc1.mozilla.com puppet-agent: (/File[/var/lib/puppet/lib]) Could not evaluate: Could not retrieve file metadata for puppet://releng-puppet2.srv.releng.scl3.mozilla.com/plugins: Failed to open TCP connection to releng-puppet2.srv.releng.scl3.mozilla.com:8140 (Connection timed out - connect(2) for "releng-puppet2.srv.releng.scl3.mozilla.com" port 8140)
Is that server being shutdown??
bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com
releng-puppet2.srv.releng.scl3.mozilla.com is unreachable
Comment 37•6 years ago
|
||
(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #36)
> Is that server being shutdown??
> bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com
> releng-puppet2.srv.releng.scl3.mozilla.com is unreachable
I'm going to answer to that, Yes it is down and probably the mdc1 puppet server should be used because the workers are in MDC1
Comment 38•6 years ago
|
||
I've tried to reboot the following, all of them were powered off. After Power on the machines restarted a few times without any success on booting up OS. After a few restarts all off them got on power off state :
t-linux64-ms-272
t-linux64-ms-273
t-linux64-ms-276
t-linux64-ms-277
Assignee | ||
Comment 39•6 years ago
|
||
We have the moonshots configured to power off after 3 failed boots.
These four {272,273,276,277} have been restarted or started since then and are working correctly:
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-272
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-273
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-276
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-tw/workers/mdc1/t-linux64-ms-277
Status: REOPENED → RESOLVED
Closed: 7 years ago → 6 years ago
Flags: needinfo?(dhouse)
Resolution: --- → FIXED
Comment 40•6 years ago
|
||
(In reply to Dave House [:dhouse] from comment #39)
> We have the moonshots configured to power off after 3 failed boots.
If we're going to stick with that then we should be getting alerts when that happens, either from nagios or iLO.
You need to log in
before you can comment on or make changes to this bug.
Description
•