gitweb3.dmz.scl3.mozilla.com having network connectivity issues

RESOLVED FIXED

Status

Infrastructure & Operations
NetOps: Other
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: dgarvey, Assigned: dcurado)

Tracking

Details

(Reporter)

Description

3 years ago
Receiving multiple socket time out from nagios to nrpe.

restarted nrpe.

[dgarvey@gitweb3.dmz.scl3 ~]$ sudo /etc/init.d/nrpe restart
Shutting down nrpe:                                        [  OK  ]
Starting nrpe:                                             [  OK  ]
[dgarvey@gitweb3.dmz.scl3 ~]$
(Reporter)

Updated

3 years ago
Assignee: infra → dgarvey
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED

Updated

3 years ago
Summary: gitweb3.dmz.scl3.mozilla.com having network conductivity issues with nagios → gitweb3.dmz.scl3.mozilla.com having network connecivity issues with nagios

Updated

3 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 1

3 years ago
gitweb3.dmz.scl3.mozilla.com is failing to ping intermittently now.
Seems to drop batches of ICMP. ssh conneciton hangs in the same period

64 bytes from gitweb3.dmz.scl3.mozilla.com (10.22.74.41): icmp_seq=484 ttl=63 time=0.772 ms
64 bytes from gitweb3.dmz.scl3.mozilla.com (10.22.74.41): icmp_seq=485 ttl=63 time=0.848 ms
64 bytes from gitweb3.dmz.scl3.mozilla.com (10.22.74.41): icmp_seq=486 ttl=63 time=0.939 ms

64 bytes from gitweb3.dmz.scl3.mozilla.com (10.22.74.41): icmp_seq=519 ttl=63 time=0.868 ms
64 bytes from gitweb3.dmz.scl3.mozilla.com (10.22.74.41): icmp_seq=520 ttl=63 time=0.951 ms
64 bytes from gitweb3.dmz.scl3.mozilla.com (10.22.74.41): icmp_seq=521 ttl=63 time=0.984 ms
(that was from admin1a.private.scl3, so this isn't just a nagios issue)
Summary: gitweb3.dmz.scl3.mozilla.com having network connecivity issues with nagios → gitweb3.dmz.scl3.mozilla.com having network connectivity issues

Updated

3 years ago
Component: Infrastructure: Other → MOC: Problems
QA Contact: jdow → lypulong
At the same time traffic drops out to gitweb3.dmz.scl3 I see a failed nagios check to geoip2.webapp.scl3.mozilla.com that recovers at the same time.
Summary: gitweb3.dmz.scl3.mozilla.com having network connectivity issues → gitweb3.dmz.scl3.mozilla.com and other hosts having network connectivity issues
--- gitweb3.dmz.scl3.mozilla.com ping statistics ---
286 packets transmitted, 185 received, 35% packet loss, time 285095ms
rtt min/avg/max/mdev = 0.779/0.905/1.117/0.060 ms
12:45 <@dcurado:#netops> yeah, known issue
12:45 <@dcurado:#netops> that started on saturday during all the fun we had
12:46 <@dcurado:#netops> and we all decided we'd double back to it at the start 
  of the week

Updated

3 years ago
Assignee: dgarvey → network-operations
Component: MOC: Problems → NetOps: Other
QA Contact: lypulong → jbarnell

Updated

3 years ago
See Also: → bug 1199982
(Assignee)

Updated

3 years ago
Assignee: network-operations → dcurado
Status: REOPENED → ASSIGNED
Logs are full of:

    scsi host1: BC_434 : Link Down on Port 0
    scsi host2: BC_434 : Link Down on Port 1
    bonding: bond0: link status definitely down for interface eth0, disabling it
    bonding: bond0: now running without any active interface !
    bonding: bond0: link status definitely down for interface eth1, disabling it
    scsi host1: BC_446 : Link UP on Port 0
    scsi host2: BC_446 : Link UP on Port 1
    bond0: link status definitely up for interface eth0, 1000 Mbps full duplex.
    bonding: bond0: making interface eth0 the new active one.
    bonding: bond0: first active interface up!
    bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
    scsi host1: BC_434 : Link Down on Port 0
    scsi host2: BC_434 : Link Down on Port 1
    bonding: bond0: link status definitely down for interface eth0, disabling it
    bonding: bond0: now running without any active interface !
    bonding: bond0: link status definitely down for interface eth1, disabling it
    scsi host1: BC_446 : Link UP on Port 0
    scsi host2: BC_446 : Link UP on Port 1
    bond0: link status definitely up for interface eth0, 1000 Mbps full duplex.
    bonding: bond0: making interface eth0 the new active one.
    bonding: bond0: first active interface up!
    bond0: link status definitely up for interface eth1, 1000 Mbps full duplex.
    scsi host2: BC_434 : Link Down on Port 1

Anything useful on the switch side?
According to :fubar gitweb3 is not used in prod at the moment.
As such, downtiming the alerts for 24h.

Updated

3 years ago
Summary: gitweb3.dmz.scl3.mozilla.com and other hosts having network connectivity issues → gitweb3.dmz.scl3.mozilla.com having network connectivity issues
With :johnb's suggestion, bounced network from the console.

> 10:40:12 < johnb> looks better to me so far
> 10:40:21 < johnb> no loss for 10k pings

Comment 10

3 years ago
--- 10.22.74.41 ping statistics ---
100000 packets transmitted, 100000 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.389/3.392/73.014/1.325 ms
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.