Closed
Bug 1409349
Opened 8 years ago
Closed 7 years ago
More machine from t-w1064,t-w864,t-w732 and t-yosemite pools are unreachable
Categories
(Infrastructure & Operations :: DCOps, task)
Infrastructure & Operations
DCOps
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aobreja, Assigned: van)
References
Details
Attachments
(1 file)
1.40 KB,
patch
|
dhouse
:
review+
dividehex
:
checked-in+
|
Details | Diff | Splinter Review |
Please check the list below,all these machines are unreachable:
-t-yosemite-r7-0229
-t-w1064-ix-139
-t-w1064-ix-138
-t-yosemite-r7-0279
-t-yosemite-r7-0225
-t-yosemite-r7-0225
-t-yosemite-r7-0137
-t-yosemite-r7-0130
-t-yosemite-r7-0068
-t-yosemite-r7-0048
-t-yosemite-r7-0045
-t-w1064-ix-117
-t-w1064-ix-312
-t-w1064-ix-313
-t-w732-ix-131
-t-w864-ix-037
-t-w732-ix-130
-t-w732-ix-107
-t-w732-ix-120
-t-w732-ix-105
-t-w732-ix-096
-t-w732-ix-081
-t-w732-ix-047
-t-w732-ix-056
-t-w732-ix-054
-t-w732-ix-041
-t-w732-ix-037
-t-w732-ix-011
-t-w732-ix-031
-t-w732-ix-033
-t-w732-ix-030
-t-w732-ix-022
-t-w732-ix-016
>Attempting SSH reboot...Failed.
>Attempting IPMI reboot...Failed.
>Machine is unreachable, manual intervention require
Reporter | ||
Updated•8 years ago
|
Blocks: t-w1064-ix-312, t-w1064-ix-117, t-yosemite-r7-0045, t-yosemite-r7-0048, t-yosemite-r7-0068, t-yosemite-r7-0130, t-yosemite-r7-0137, t-yosemite-r7-0219, t-yosemite-r7-0225, t-w732-ix-016, t-w732-ix-022, t-w732-ix-030, t-w732-ix-033, t-w732-ix-031, t-w732-ix-037, t-w732-ix-041, t-w732-ix-054, t-w732-ix-056, t-w732-ix-047, t-w732-ix-081, t-w732-ix-096, t-w732-ix-105, t-w732-ix-107, t-w732-ix-130, t-w864-ix-037, 1409315, t-w732-ix-131
Reporter | ||
Comment 2•8 years ago
|
||
Also the list bellow:
-t-w732-ix-104
-t-w732-ix-027
-t-w732-ix-003
-t-w732-ix-020
-t-w732-ix-024
-t-w732-ix-040
-t-w732-ix-073
-t-w732-ix-087
-t-w732-ix-086
-t-w732-ix-091
-t-w732-ix-098
-t-w732-ix-065
-t-w732-ix-106
-t-w732-ix-122
-t-w732-ix-141
-t-w732-ix-111
Reporter | ||
Updated•8 years ago
|
Blocks: t-yosemite-r7-0229
Reporter | ||
Updated•8 years ago
|
Blocks: t-yosemite-r7-0279
Reporter | ||
Comment 3•8 years ago
|
||
Also:
-t-yosemite-r7-0110
-t-yosemite-r7-0108
Blocks: t-yosemite-r7-0108, t-yosemite-r7-0110
Comment 4•7 years ago
|
||
fubar, it was proposed by arr that we stop filing dcops bugs for machines that were no longer reachable (buildduty can't recover). Do you have context here? One of the primary concerns is to keep low pool counts from getting lower, e.g. yosemite and xp. Reason being as they are our latest pending bottlenecks[1]. Perhaps we should be making exceptions? Any thoughts from you or anyone watching this component?
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1409439
Flags: needinfo?(klibby)
Comment 5•7 years ago
|
||
I don't have context, unfortunately. I'd like to understand why machines are going unreachable; that seems like something is very broken.
OTOH, I expect that response to this bug, at the very least, will be delayed by :van being in MDC1 setting up the minis from move train #2.
Flags: needinfo?(klibby)
Comment 6•7 years ago
|
||
okay thanks. It usually is a sign they are very broken, yes. Unfortunately though, this is normal and an essential escalation step in recovering the hardware machines we have.
Thank you for letting us know about expected delays, buildduty can act appropriately. If anyone with context to dcops and the data center migration could give us a estimated timeline to actioning this bug, that would be great.
Assignee | ||
Comment 7•7 years ago
|
||
oh wow, there's about 30 machines just in this bug. i'll be on site today and perhaps tomorrow to catch up on these bugs.
Assignee | ||
Updated•7 years ago
|
Assignee: server-ops-dcops → vle
Thank you Van! If you need anything tested or confirmed on the systems or in the logs after boot/image on these, please ping me. I'd like to help if I can.
Comment 9•7 years ago
|
||
:van, can you put a spider kvm on one of those yosemite hosts for me so I can try to determine why it's network is going unresponsive. (Assuming they aren't completely hung up)
Assignee | ||
Comment 12•7 years ago
|
||
:dividehex, i've attached a spider to one of the minis - 10.26.52.254.
going through the list but so far all the minis ive come across are running into the local fw issue. im reimaging them and will have a more comprehensive update.
in comment 0, you have -t-yosemite-r7-0225 listed twice, did you mean another host is down?
See Also: → 1401601
Assignee | ||
Comment 13•7 years ago
|
||
:dividehex/markco, it looks like the win7 testers are also running into a APIPA issue. i've had to reimage quite a few today and will continue troubleshooting the rest of the win7 nodes tomorrow. do you want me to attach a kvm or are you able to use ipmi's console redirection?
Decommission:
-t-w1064-ix-138 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-139 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-312 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-312 - these 4 are same chassis, bad backplane, out of warranty
back online:
-t-yosemite-r7-0229 - local fw issue, attached spider kvm for troubleshooting
-t-yosemite-r7-0279 - local fw issue, reimaged
-t-yosemite-r7-0225 - local fw issue, reimaged
-t-yosemite-r7-0137 - local fw issue, reimaged
-t-yosemite-r7-0130 - local fw issue, reimaged
-t-yosemite-r7-0068 - local fw issue, reimaged
-t-yosemite-r7-0048 - local fw issue, reimaged
-t-yosemite-r7-0045 - local fw issue, reimaged
-t-yosemite-r7-0110 - local fw issue, reimaged
-t-yosemite-r7-0108 - local fw issue, reimaged
-t-w1064-ix-117 - back online
-t-w732-ix-131 - private IP addressing, reimaged
-t-w864-ix-037 - back online
-t-w732-ix-130 - private IP addressing, reimaged
pending:
-t-yosemite-r7-402 - MDC1 node, tracked in bug 1409281, opened QTS REQ0194461
still need to troubleshoot the following nodes tomorrow:
-t-w732-ix-107
-t-w732-ix-120
-t-w732-ix-105
-t-w732-ix-096
-t-w732-ix-081
-t-w732-ix-047
-t-w732-ix-056
-t-w732-ix-054
-t-w732-ix-041
-t-w732-ix-037
-t-w732-ix-011
-t-w732-ix-031
-t-w732-ix-033
-t-w732-ix-030
-t-w732-ix-022
-t-w732-ix-016
-t-w732-ix-104
-t-w732-ix-027
-t-w732-ix-003
-t-w732-ix-020
-t-w732-ix-024
-t-w732-ix-040
-t-w732-ix-073
-t-w732-ix-087
-t-w732-ix-086
-t-w732-ix-091
-t-w732-ix-098
-t-w732-ix-065
-t-w732-ix-106
-t-w732-ix-122
-t-w732-ix-141
-t-w732-ix-111
Flags: needinfo?(jwatkins)
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(mcornmesser)
Comment 14•7 years ago
|
||
Van: Can you change the bios graphic card priority on 3 of the w732 machines? Once that is done I can connect through IPMI and do some trouble shooting.
Flags: needinfo?(mcornmesser)
Comment 15•7 years ago
|
||
Van, could you check the minis #219 and #120 in scl3, and #444 in mdc1? I think that duplicate entry for t-yosemite-r7-225 was meant for t-yosemite-r7-219
SCL3:
https://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/status.cgi?hostgroup=t-yosemite-r7-machines&style=detail&limit=100&sorttype=1&sortoption=6&sorttype=2&sortoption=6
Shows these three as down for 10h+:
(expected, on kvm) t-yosemite-r7-0229.test.releng.scl3.mozilla.com 20d 22h 6m 52s 30/30 PING CRITICAL - Packet loss = 100%
t-yosemite-r7-0219.test.releng.scl3.mozilla.com 11d 1h 5m 10s 30/30 PING CRITICAL - Packet loss = 100%
t-yosemite-r7-0120.test.releng.scl3.mozilla.com 0d 15h 0m 48s 30/30 PING CRITICAL - Packet loss = 100%
MDC1:
https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=t-yosemite-r7-machines&style=detail&limit=100&sorttype=1&sortoption=6&sorttype=2&sortoption=6
t-yosemite-r7-444.test.releng.mdc1.mozilla.com 2d 5h 22m 19s 3/3 PING CRITICAL - Packet loss = 100%
(loaner, I think expected) t-yosemite-r7-393.test.releng.mdc1.mozilla.com
Details
These are the mac minis that were down yesterday:
12 minis were non-responsive. Nagios showed 11 with 2 to 19 days unresponsive (plus the one in mdc1)
t-yosemite-r7-0045 reported 10/17. hw reboot fixed 08/02, 07/21, 06/30, 02/13, 2016: 12/21(found off), 09/26, 07/13, 07/05
t-yosemite-r7-0048 reported 10/17
t-yosemite-r7-0068 10/17. hardware reboot fixed 08/08, 08/02, 06/06, 04/24, 02/27, 02/22
t-yosemite-r7-0108 10/17. hardware reboot fixed 08/07, 08/02
t-yosemite-r7-0110 10/17
t-yosemite-r7-0130 10/17. reimage fixed 09/26 (host-based firewall?), hw reboot fixed 08/07, 03/06, 02/08
t-yosemite-r7-0137 10/17. earlier reports cancelled 06/26, 12/22
t-yosemite-r7-0219 10/17 (linked as a bug but not listed in a comment)
t-yosemite-r7-0225 10/17. reimage fixed 02/06
t-yosemite-r7-0229 10/17. reimage fixed 09/28 (host-based firewall?), reimaged 09/07, cancelled 08/10, hw reboot fixed 08/02, 02/22, 01/17, 2016: 12/15, 11/28, hw reboot fixed 05/18, loaner 05/12-05/18
t-yosemite-r7-0279 10/17. hardware reboot fixed 04/24
t-yosemite-r7-402 reported 10/16 (found working)
t-yosemite-r7-444 reported 10/18 (linked as a bug but not listed in a comment)
Flags: needinfo?(vle)
Assignee | ||
Comment 16•7 years ago
|
||
:dhouse, can you open a separate bug for MDC1 minis? it makes it much easier to track since they are different data centers. i'm on site and will continue troubleshooting.
Flags: needinfo?(vle)
Comment 17•7 years ago
|
||
(In reply to Van Le [:van] from comment #13)
> pending:
> -t-yosemite-r7-402 - MDC1 node, tracked in bug 1409281, opened QTS REQ0194461
>
Thx Van! The only problem MDC1 mini is covered already in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1409743 with QTS REQ0194461
So that doesn't need to be handled in this bug.
Assignee | ||
Comment 18•7 years ago
|
||
took care of all the yosemite and w7 left in the bug. can you open new bugs for any future bad hosts? the "one bug to track them all" makes it hard to chase down/look up their past issues.
:markco, i changed resolution to 3 hosts as requested.
t-yosemite-r7-0219 - local fw issue, reimaged
t-yosemite-r7-0120 - local fw issue, reimaged
t-w732-ix-107 - APIPA, rebooted and came back online
t-w732-ix-120 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-105 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-096 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-081 - APIPA, rebooted with no luck, reimaged
t-w732-ix-047 - APIPA, rebooted with no luck, reimaged
t-w732-ix-056 - APIPA, rebooted with no luck, reimaged
t-w732-ix-054 - APIPA, rebooted with no luck, reimaged
t-w732-ix-041 - bad drive, reimaged
t-w732-ix-037 - APIPA, rebooted with no luck, reimaged
t-w732-ix-011 - APIPA, rebooted with no luck, reimaged
t-w732-ix-031 - APIPA, rebooted with no luck, reimaged
t-w732-ix-033 - APIPA, rebooted with no luck, reimaged
t-w732-ix-030 - APIPA, rebooted with no luck, reimaged
t-w732-ix-022 - bad drive, reimaged
t-w732-ix-016 - APIPA, rebooted with no luck, reimaged
t-w732-ix-104 - APIPA, rebooted with no luck, reimaged
t-w732-ix-027 - APIPA, rebooted with no luck, reimaged
t-w732-ix-003 - APIPA, rebooted with no luck, reimaged
t-w732-ix-020 - APIPA, rebooted with no luck, reimaged
t-w732-ix-024 - APIPA, rebooted with no luck, reimaged
t-w732-ix-040 - APIPA, rebooted with no luck, reimaged
t-w732-ix-073 - APIPA, rebooted with no luck, reimaged
t-w732-ix-087 - APIPA, rebooted with no luck, reimaged
t-w732-ix-086 - APIPA, rebooted with no luck, reimaged
t-w732-ix-091 - bad drive, reimaged
t-w732-ix-098 - APIPA, rebooted with no luck, reimaged
t-w732-ix-065 - APIPA, rebooted with no luck, reimaged
t-w732-ix-106 - APIPA, rebooted with no luck, reimaged
t-w732-ix-122 - APIPA, rebooted with no luck, reimaged
t-w732-ix-141 - APIPA, rebooted with no luck, reimaged
t-w732-ix-111 - bad drive, reimaged
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 19•7 years ago
|
||
to note, the bad drives in previous comment were swapped with drives from machines we decommissioned (same manufacturer/model). all of these machines are out of warranty and the drives are hit or miss with the manufacturer warranty.
Comment 20•7 years ago
|
||
I took a look at t-w732-ix-120. I suspect that the machines weren't able to communicate on a boot and that is what caused the issue. They were probably looping back to themselves for DNS as well. Which is why a reboot did not fix it.
Flags: needinfo?(mcornmesser)
Assignee | ||
Comment 21•7 years ago
|
||
:markco, why is this happening in the first place? are we able to fix the issue or is reimaging going to be the fix for the APIPA issues?
Flags: needinfo?(mcornmesser)
Comment 22•7 years ago
|
||
(In reply to Van Le [:van] from comment #21)
> :markco, why is this happening in the first place? are we able to fix the
> issue or is reimaging going to be the fix for the APIPA issues?
I will dive through the logs on the local machine tomorrow to see if I can see any type of root cause.
Once the machine reaches this state reimaging will be the most straight forward way to get it back into the pool.
Flags: needinfo?(mcornmesser)
Comment 23•7 years ago
|
||
I'm going to go out on a limb here and make an assumption since I don't really have the time capacity to pin down the cause. My best guess regarding the APIPA issue is that some how the local firewall is interfere with the dhcp exchange. I'm not sure if it stems from connection state tables to fail to keep track of the outgoing packets (such as a timeout or excessive delay between responses) or maybe it doesn't handle a new (non-renewal) lease properly since there is no IP and the returning packet is from a unicast ip while the outgoing packet recorded in the state table was to broadcast.
I'm just not sure.
Anyway, if we make this assumption we should be able to workaround it with a more permissive firewall rule to allow dhcp regardless of state tracking.
This also applies to bug 1401601.
markco, Q: you will need to make a rule under the gpo to do the same. Basically on ingress, allow udp from any ip on source port 67 to any ip destination port 68.
Comment 24•7 years ago
|
||
Flags: needinfo?(jwatkins)
Attachment #8921280 -
Flags: review?(dhouse)
Attachment #8921280 -
Flags: review?(dhouse) → review+
Comment 25•7 years ago
|
||
Comment on attachment 8921280 [details] [diff] [review]
Allow dhcp client exchange
https://hg.mozilla.org/build/puppet/rev/9b1347a173465832858425e18cf3dec896a4d940
https://hg.mozilla.org/build/puppet/rev/bdb344e213910583c09dea72b4ef412b5c60a90e
Attachment #8921280 -
Flags: checked-in+
Assignee | ||
Comment 26•7 years ago
|
||
closing out bug, please reopen if you need any additional hands on with any of the 50ish hosts listed in this bug.
Assignee | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
See Also: → t-w1064-ix-312
Updated•7 years ago
|
See Also: → t-w1064-ix-139
You need to log in
before you can comment on or make changes to this bug.
Description
•