Closed Bug 636051 Opened 14 years ago Closed 14 years ago

DHCP Requests failing in SCL1

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bkero, Assigned: bkero)

Details

(Whiteboard: [slaveduty][buildslaves])

Fedora 32/64 DHCP requests have been failing. I've monitored the DHCP logs on ns1/ns2, and see successful requests going through. Could someone please update this ticket with more observed symptoms?
Correction, the DHCP servers are serving requests to other hosts -- however there are no recent logs of Fedora 32/64 hosts requesting anything.
So far, I've seen failures from nagios on: talos-r3-fed-044 talos-r3-fed-039 talos-r3-fed64-029 talos-r3-fed-004 talos-r3-fed64-025 talos-r3-fed-009 talos-r3-fed-007 talos-r3-fed-042 talos-r3-fed-005 talos-r3-fed64-038 talos-r3-fed-047 talos-r3-fed64-012 talos-r3-fed64-020 talos-r3-fed64-027 talos-r3-fed64-011 talos-r3-fed64-033 talos-r3-fed64-003 talos-r3-fed64-005 talos-r3-fed64-004 talos-r3-fed64-039 but successful reboots of talos-r3-fed64-030 talos-r3-leopard-053 talos-r3-fed64-009 talos-r3-w7-044 talos-r3-w7-012 talos-r3-fed64-014 there doesn't seem to be an IP-related pattern: BAD: talos-r3-fed-044.build.scl1.mozilla.com has address 10.12.49.171 talos-r3-fed-039.build.scl1.mozilla.com has address 10.12.49.166 talos-r3-fed64-029.build.scl1.mozilla.com has address 10.12.49.209 talos-r3-fed-004.build.scl1.mozilla.com has address 10.12.49.131 talos-r3-fed64-025.build.scl1.mozilla.com has address 10.12.49.205 talos-r3-fed-009.build.scl1.mozilla.com has address 10.12.49.136 talos-r3-fed-007.build.scl1.mozilla.com has address 10.12.49.134 talos-r3-fed-042.build.scl1.mozilla.com has address 10.12.49.169 talos-r3-fed-005.build.scl1.mozilla.com has address 10.12.49.132 talos-r3-fed64-038.build.scl1.mozilla.com has address 10.12.49.218 talos-r3-fed-047.build.scl1.mozilla.com has address 10.12.49.174 talos-r3-fed64-012.build.scl1.mozilla.com has address 10.12.49.192 talos-r3-fed64-020.build.scl1.mozilla.com has address 10.12.49.200 talos-r3-fed64-027.build.scl1.mozilla.com has address 10.12.49.207 talos-r3-fed64-011.build.scl1.mozilla.com has address 10.12.49.191 talos-r3-fed64-033.build.scl1.mozilla.com has address 10.12.49.213 talos-r3-fed64-003.build.scl1.mozilla.com has address 10.12.49.183 talos-r3-fed64-005.build.scl1.mozilla.com has address 10.12.49.185 talos-r3-fed64-004.build.scl1.mozilla.com has address 10.12.49.184 talos-r3-fed64-039.build.scl1.mozilla.com has address 10.12.49.219 GOOD: talos-r3-fed64-030.build.scl1.mozilla.com has address 10.12.49.210 talos-r3-leopard-053.build.scl1.mozilla.com has address 10.12.50.53 talos-r3-fed64-009.build.scl1.mozilla.com has address 10.12.49.189 talos-r3-w7-044.build.scl1.mozilla.com has address 10.12.50.204 talos-r3-w7-012.build.scl1.mozilla.com has address 10.12.50.173 talos-r3-fed64-014.build.scl1.mozilla.com has address 10.12.49.194
Not sure if this is caused by the same DNS bustage that started at 15:54PST. Raising to blocker, because of impact of burning builds/tests to tree (where developers are waiting for green builds to "go" for beta12). From irc just now with dustin, bkero this is still a problem, and still being investigated.
Severity: major → blocker
Whiteboard: [slaveduty][buildslaves]
Assignee: server-ops → bkero
Rebooting machines and collecting the time of failure as I go: talos-r3-fed-044: 15:22 talos-r3-fed-039: 15:22 talos-r3-fed64-029: 15:24 talos-r3-fed-004: 15:22 talos-r3-fed64-025: 15:28 talos-r3-fed-009: 15:26 talos-r3-fed-007: 15:27 talos-r3-fed-042: 15:28 talos-r3-fed-005: 15:29 talos-r3-fed64-038: 15:30 talos-r3-fed-047: 15:34 talos-r3-fed64-012: 15:34 talos-r3-fed64-020: 15:39 talos-r3-fed64-027: 15:37 talos-r3-fed64-011: 15:37 talos-r3-fed64-033: 15:41 talos-r3-fed64-003: Has a lease? talos-r3-fed64-005: 15:46 talos-r3-fed64-004: 15:47 talos-r3-fed64-039: 15:49 Also, from dustin: talos-r3-fed64-008: 15:43 talos-r3-fed64-037: 15:43 talos-r3-fed64-054: 15:34 Short version is that these machines all rebooted, sent some discovers, and then dhclient just exited. Long version is: Feb 22 15:22:01 talos-r3-fed-044 NetworkManager: <info> dhclient started with pid 1108 Feb 22 15:22:01 talos-r3-fed-044 dhclient: Internet Systems Consortium DHCP Client 4.1.0p1 Feb 22 15:22:01 talos-r3-fed-044 dhclient: Copyright 2004-2009 Internet Systems Consortium. Feb 22 15:22:01 talos-r3-fed-044 dhclient: All rights reserved. Feb 22 15:22:01 talos-r3-fed-044 dhclient: For info, please visit http://www.isc.org/sw/dhcp/ Feb 22 15:22:01 talos-r3-fed-044 dhclient: Feb 22 15:22:01 talos-r3-fed-044 dhclient: Listening on LPF/eth1/34:15:9e:18:d7:98 Feb 22 15:22:01 talos-r3-fed-044 dhclient: Sending on LPF/eth1/34:15:9e:18:d7:98 Feb 22 15:22:01 talos-r3-fed-044 dhclient: Sending on Socket/fallback Feb 22 15:22:01 talos-r3-fed-044 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 4 Feb 22 15:22:05 talos-r3-fed-044 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 7 Feb 22 15:22:12 talos-r3-fed-044 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 14 Feb 22 15:22:26 talos-r3-fed-044 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 12 Feb 22 15:22:38 talos-r3-fed-044 dhclient: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 13 Feb 22 15:22:47 talos-r3-fed-044 NetworkManager: <info> (eth1): canceled DHCP transaction, dhcp client pid 1108
zandr: thanks for being willing to drive out to scl and take care of this so quickly! I will apply a resiliency fix in bug 636069, so unless there's further research to do on the DHCP server side, I think this is done.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.