Closed
Bug 1464070
(t-linux64-ms-280)
Opened 7 years ago
Closed 7 years ago
[MDC1] t-linux64-ms-280 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: zfay, Assigned: dragrom)
References
Details
Attachments
(3 files)
The machine is not showing up in taskcluster and freezes up in PXE boot.
Reporter | ||
Comment 1•7 years ago
|
||
t-linux64-ms-279 does the same thing after a cold boot.
On 280, the pxeboot menu never displays. Instead it is trying to connect and fails over to ipv6. On this text is displayed:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
>> Booting PXE over IPv4.
Station IP address is 10.49.58.100
```
other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So it was working correctly but was not running taskcluster worker. However, we still need to reimage it to reclaim it from being a loaner (bug is already closed). So we need to fix the pxeboot.
279 is not running taskcluster worker. It is not disabled in the puppet node definitions, but it appears to need reimaging or to be re-puppetized because it does not get have any taskcluster worker files in place (may not be getting the correct node definition. there is no /etc/taskcluster*yaml or /usr/local/bin/run-tc-worker.sh etc).
However we cannot reimage 279 because sees the same pxeboot failure as 280.
Assignee | ||
Comment 4•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #2)
> On 280, the pxeboot menu never displays. Instead it is trying to connect and
> fails over to ipv6. On this text is displayed:
> ```
> >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
>
> >> Booting PXE over IPv4.
> Station IP address is 10.49.58.100
> ```
> other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So
> it was working correctly but was not running taskcluster worker. However, we
> still need to reimage it to reclaim it from being a loaner (bug is already
> closed). So we need to fix the pxeboot.
>
> 279 is not running taskcluster worker. It is not disabled in the puppet node
> definitions, but it appears to need reimaging or to be re-puppetized because
> it does not get have any taskcluster worker files in place (may not be
> getting the correct node definition. there is no /etc/taskcluster*yaml or
> /usr/local/bin/run-tc-worker.sh etc).
> However we cannot reimage 279 because sees the same pxeboot failure as 280.
Looking into nodes.pp:
# Loaner for dividehex
node 't-linux64-ms-279.test.releng.mdc1.mozilla.com' {
$aspects = [ 'low-security' ]
include toplevel::server
}
(In reply to Dragos Crisan [:dragrom] from comment #4)
> (In reply to Dave House [:dhouse] from comment #2)
> > On 280, the pxeboot menu never displays. Instead it is trying to connect and
> > fails over to ipv6. On this text is displayed:
> > ```
> > >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
> >
> > >> Booting PXE over IPv4.
> > Station IP address is 10.49.58.100
> > ```
> > other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So
> > it was working correctly but was not running taskcluster worker. However, we
> > still need to reimage it to reclaim it from being a loaner (bug is already
> > closed). So we need to fix the pxeboot.
> >
> > 279 is not running taskcluster worker. It is not disabled in the puppet node
> > definitions, but it appears to need reimaging or to be re-puppetized because
> > it does not get have any taskcluster worker files in place (may not be
> > getting the correct node definition. there is no /etc/taskcluster*yaml or
> > /usr/local/bin/run-tc-worker.sh etc).
> > However we cannot reimage 279 because sees the same pxeboot failure as 280.
>
> Looking into nodes.pp:
> # Loaner for dividehex
> node 't-linux64-ms-279.test.releng.mdc1.mozilla.com' {
> $aspects = [ 'low-security' ]
> include toplevel::server
> }
:) Thank you!
Mark and Jake, I'm cc'ing you on this bug where we started tracking the pxeboot failing on some of the linux moonshots:
So far, these fail to network boot (timeout):
t-linux64-ms-{193,279,280}
There are likely others that also fail, but we have not tried netbooting all of the linux nodes recently.
Comment 7•7 years ago
|
||
Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to over IPv4, fails, and then continuously try over IPv6. I find the proximity of 279, 280, and 281 curious.
(In reply to Mark Cornmesser [:markco] from comment #7)
> Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to
> over IPv4, fails, and then continuously try over IPv6. I find the proximity
> of 279, 280, and 281 curious.
+1 I'll try a network boot on the other linux nodes on moon-chassis-7 (the linux queue is empty. so I'm not concerned about it backing-up if I pull a few workers out):
t-linux64-ms-{271..280}.test.releng.mdc1.mozilla.com
c1n1..c10n1
(In reply to Dave House [:dhouse] from comment #8)
> (In reply to Mark Cornmesser [:markco] from comment #7)
> > Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to
> > over IPv4, fails, and then continuously try over IPv6. I find the proximity
> > of 279, 280, and 281 curious.
>
> +1 I'll try a network boot on the other linux nodes on moon-chassis-7 (the
> linux queue is empty. so I'm not concerned about it backing-up if I pull a
> few workers out):
>
> t-linux64-ms-{271..280}.test.releng.mdc1.mozilla.com
> c1n1..c10n1
t-linux64-ms-271 has the same failure: Times-out on ipv4 pxeboot.
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
>> Booting PXE over IPv4.
Station IP address is 10.49.58.92
```
I'm testing the others also {272..278}.
Maybe something like the hp uefi boot setting is not set for this chassis or set of machines.
Comment 10•7 years ago
|
||
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
>> Booting PXE over IPv4.
Station IP address is 10.49.58.92
Server IP address is 10.48.75.31
NBP filename is /bootx64.efi
NBP filesize is 0 Bytes
PXE-E18: Server response timeout.
```
Comment 11•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #9)
> I'm testing the others also {272..278}.
t-linux64-ms-{271..280} all have this problem.
Comment 12•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #10)
> ```
> >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
>
> >> Booting PXE over IPv4.
> Station IP address is 10.49.58.92
>
> Server IP address is 10.48.75.31
> NBP filename is /bootx64.efi
> NBP filesize is 0 Bytes
> PXE-E18: Server response timeout.
> ```
Rob, could you verify that the hp uefi and other dhcp-options are set correctly in mdc1 for the moon-chassis-7 hosts (I found the hp uefi filter in infoblox but I was not able to find where it was turned on)?
[10.49.58.91 - 10.49.58.100] (t-linux64-ms-{271..280} and
10.49.40.182 t-w1064-ms-281.wintest.releng.mdc1.mozilla.com
We are seeing timeouts on pxeboot for these (for both linux and windows)
Above is what I see for the linux machines at pxeboot, and here is for windows on 10.49.40.182:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
>> Booting PXE over IPv4.
PXE-E18: Server response timeout.
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv6)
>> Booting PXE over IPv6
PXE-E21: Remote boot cancelled.
```
Flags: needinfo?(rtucker)
Comment 13•7 years ago
|
||
Well shoot, I spot-checked nodes on moon-chassis-1 and moon-chassis-6, and I see the same pxe timeout (i verified that I get the same result when tried from the ilo ssh and java interfaces):
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
>> Booting PXE over IPv4.
Station IP address is 10.49.58.76
Server IP address is 10.48.75.31
NBP filename is /bootx64.efi
NBP filesize is 0 Bytes
PXE-E18: Server response timeout.
```
I tried cold-boot and a restart, and get the same pxe timeout for both.
I know that we successfully uefi/pxebooted these earlier this week (CIDuty and I have reimaged about 8 linux moonshots in mdc1). So maybe there was a change in the pxe/tftp server or network.
Comment 14•7 years ago
|
||
I haven't made any changes.
Is there anything specific you want me to confirm?
Flags: needinfo?(rtucker)
Comment 15•7 years ago
|
||
(In reply to Rob Tucker [:rtucker] from comment #14)
> I haven't made any changes.
>
> Is there anything specific you want me to confirm?
Could you show me where the hp uefi boot filter is set on the releng network in infoblox mdc1, and what other dhcp options are set for test.releng.mdc1? I tried to find them, but I only found the definition for the hp uefi filter and so I think I'm not checking the correct place.
Flags: needinfo?(rtucker)
Comment 16•7 years ago
|
||
The filter "HP - UEFI Clients 00007" is set correctly on 10.51.56.0/22 and 10.49.56.0/22 hence getting the filename of /bootx64.efi. You wouldn't get /bootx64.efi without the filter.
You can view the options by clicking the settings wheel next to the network in the IPAM browser and looking at the IPv4 DHCP Options
Is 10.48.75.31 the proper tftp server? This is the most likely issue, NOT the filter.
Flags: needinfo?(rtucker)
Comment 17•7 years ago
|
||
(In reply to Rob Tucker [:rtucker] from comment #16)
> The filter "HP - UEFI Clients 00007" is set correctly on 10.51.56.0/22 and
> 10.49.56.0/22 hence getting the filename of /bootx64.efi. You wouldn't get
> /bootx64.efi without the filter.
>
> You can view the options by clicking the settings wheel next to the network
> in the IPAM browser and looking at the IPv4 DHCP Options
>
> Is 10.48.75.31 the proper tftp server? This is the most likely issue, NOT
> the filter.
Thank you. Following your directions, I see the dhcp options.
I'll change the next server to match the change for the other use of that admin server from bug 1354300
Comment 18•7 years ago
|
||
I've changed the tftp-server for releng.mdc1 and releng.mdc2, in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1464493. I don't see the change yet when I test rebooting t-linux64-ms-280. I'll try it again in the morning (maybe it takes some time to apply).
Comment 19•7 years ago
|
||
Just a note, t-linux64-ms-280 is a loaner for :dragrom
Comment 20•7 years ago
|
||
(In reply to Attila Craciun [:arny] from comment #19)
> Just a note, t-linux64-ms-280 is a loaner for :dragrom
:arny could you link the loaner bug to this bug?
Comment 21•7 years ago
|
||
I tested pxeboot again this morning, on t-linux64-ms-005 as it was not running a task, and it still gets the old tftp server:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4)
>> Booting PXE over IPv4.
Station IP address is 10.49.58.5
Server IP address is 10.48.75.31
NBP filename is /bootx64.efi
NBP filesize is 0 Bytes
PXE-E18: Server response timeout.
```
Comment 22•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #20)
> (In reply to Attila Craciun [:arny] from comment #19)
> > Just a note, t-linux64-ms-280 is a loaner for :dragrom
>
> :arny could you link the loaner bug to this bug?
It is already set Bug 1410207.
Comment 23•7 years ago
|
||
Dave, linux-ms-193 PXE works but is centos 7.
Comment 24•7 years ago
|
||
Comment 25•7 years ago
|
||
worked with :dhouse via IRC and updated the next-server options.
Comment 26•7 years ago
|
||
> (In reply to Dave House [:dhouse] from comment #20)
> > (In reply to Attila Craciun [:arny] from comment #19)
> > > Just a note, t-linux64-ms-280 is a loaner for :dragrom
> >
> > :arny could you link the loaner bug to this bug?
>
> It is already set Bug 1410207.
Ok, since that is resolved I'll reimage 280 (not that the pxeboot is fixed) to put it back into service.
Comment 27•7 years ago
|
||
Do not re-image 280, :dragrom still need it as loan :).
Comment 28•7 years ago
|
||
PXE works now, however, the Ubuntu image is broken. Tested on 001 and 193, same message even if I add the PUPPET_PASS option or not.
Comment 29•7 years ago
|
||
Dragos is needing #280 as a loaner for the next two weeks for testing puppet changes for bug 1465309
Comment 30•7 years ago
|
||
(In reply to Attila Craciun [:arny] from comment #28)
> Created attachment 8982132 [details]
> Screenshot from 2018-05-31 09-43-47.png
>
> PXE works now, however, the Ubuntu image is broken. Tested on 001 and 193,
> same message even if I add the PUPPET_PASS option or not.
The pxe/netboot reimaging process is now fixed. So when we need to reimage this one it will work.
Comment 31•7 years ago
|
||
(In reply to Attila Craciun [:arny] from comment #19)
> Just a note, t-linux64-ms-280 is a loaner for :dragrom
Apart from working on that machine and changing syslog to use TCP instead of UDP, can you confirm that you have this machine as a loan?
Flags: needinfo?(dcrisan)
Assignee | ||
Comment 32•7 years ago
|
||
Yes, I'll need this machine as loan, to test changes for 1465309
Flags: needinfo?(dcrisan)
Reporter | ||
Comment 33•7 years ago
|
||
Should we leave this bug opened and @dragrom let us know when you are done with the machine?
Comment 34•7 years ago
|
||
:dragrom, are you done with the loaner t-linux64-ms-280 ?
Alias: t-linux64-ms-280
Flags: needinfo?(dcrisan)
Summary: t-linux64-ms-280 problem tracking → t-linux64-ms-280.test.releng.mdc1.mozilla.com. problem tracking
Assignee | ||
Comment 35•7 years ago
|
||
We can consider this loaner a staging worker, like t-yosemite-r7-380. In my opinion, we can close this bug
Flags: needinfo?(dcrisan)
Comment 36•7 years ago
|
||
Ok, let's keep this bug open as a marker for it as not being production (Until there is a better way to track the different types).
Comment 37•7 years ago
|
||
Puppet failures were repeated today on this machine. So I stopped this machine (powered-off through ilo).
Assignee: nobody → dcrisan
Flags: needinfo?(dcrisan)
Assignee | ||
Comment 38•7 years ago
|
||
This is the error:
Thu Jul 05 12:20:23 -0700 2018 Puppet (err): Could not delete user cltbld: Execution of '/usr/sbin/userdel cltbld' returned 8: userdel: user cltbld is currently used by process 2882
Thu Jul 05 12:20:23 -0700 2018 /Stage[main]/Main/User[cltbld]/ensure (err): change from present to absent failed: Could not delete user cltbld: Execution of '/usr/sbin/userdel cltbld' returned 8: userdel: user cltbld is currently used by process 2882
This error is caused by the following definition from the puppet master:
node 't-linux64-ms-280.test.releng.mdc1.mozilla.com' {
$aspects = [ 'low-security' ]
include toplevel::server
}
After landing the patch from Bug 1473281, this error will disappear.
Flags: needinfo?(dcrisan)
Comment 39•7 years ago
|
||
I kicked off the reimage this morning, but I never saw a report from puppet. I may have typo'd the kickstart password. i'm re-trying reimaging it now.
Comment 40•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #39)
> I kicked off the reimage this morning, but I never saw a report from puppet.
> I may have typo'd the kickstart password. i'm re-trying reimaging it now.
I checked through ilo and found a kernal panic (not able to mount fs) logged to the screen (from the earlier reimage).
Updated•7 years ago
|
Summary: t-linux64-ms-280.test.releng.mdc1.mozilla.com. problem tracking → [MDC1] t-linux64-ms-280 problem tracking
Comment 41•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #40)
> (In reply to Dave House [:dhouse] from comment #39)
> > I kicked off the reimage this morning, but I never saw a report from puppet.
> > I may have typo'd the kickstart password. i'm re-trying reimaging it now.
>
> I checked through ilo and found a kernal panic (not able to mount fs) logged
> to the screen (from the earlier reimage).
I watched on retrying the reimage and it hit a timeout on the initrd.gz download for initial setup. On trying again to capture a log through ilo+ssh, it did not timeout and so it entered the ubuntu setup correctly.
Comment 42•7 years ago
|
||
(In reply to Dave House [:dhouse] from comment #41)
> (In reply to Dave House [:dhouse] from comment #40)
> > (In reply to Dave House [:dhouse] from comment #39)
> > > I kicked off the reimage this morning, but I never saw a report from puppet.
> > > I may have typo'd the kickstart password. i'm re-trying reimaging it now.
> >
> > I checked through ilo and found a kernal panic (not able to mount fs) logged
> > to the screen (from the earlier reimage).
>
> I watched on retrying the reimage and it hit a timeout on the initrd.gz
> download for initial setup. On trying again to capture a log through
> ilo+ssh, it did not timeout and so it entered the ubuntu setup correctly.
https://papertrailapp.com/systems/1645518191/events
280 came up correctly from reinstall+puppetize.
Comment 43•7 years ago
|
||
There were repeated puppet failures on this machine today. I've powered it off.
Dragos, please power it back on if you need to test on it tomorrow.
Flags: needinfo?(dcrisan)
Comment 44•7 years ago
|
||
Rising this bug in priority as it's a known machine with problems.
When the machine is fixed, feel free to remove the P1 from the bug.
Priority: -- → P1
Assignee | ||
Comment 45•7 years ago
|
||
The errors were generated by my tests to install generic-worker on Linux. The puppet is now fixed on my environment. Please restart the machine
Flags: needinfo?(dcrisan)
Assignee | ||
Comment 47•7 years ago
|
||
Now, this machine is part of linux-talos staging pool: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-b
Assignee | ||
Comment 48•7 years ago
|
||
I'll close the bug since this machine is now part of the staging pool
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•