Closed
Bug 1021162
Opened 11 years ago
Closed 11 years ago
Increased slowness in communication between client and Puppet server
Categories
(Infrastructure & Operations Graveyard :: NetOps: Other, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: markco, Assigned: dcurado)
Details
Since Tuesday I have seen significant slowdown between a host at 10.26.40.153 and a Puppet server releng-puppet2.srv.releng.scl3.mozilla.com 10.26.48.50. When the Puppet is being run I am seeing network I/O < 1 mbps and as slow as 4 Kbps. I don't have any idea of what the network I/o was previously because the task was being finished within a reasonable time, less than 15 minutes. However, since Tuesday the task has been taking hours to complete.
| Assignee | ||
Updated•11 years ago
|
Assignee: network-operations → dcurado
| Assignee | ||
Comment 2•11 years ago
|
||
both these hosts are in the releng.scl3 domain.
Not much network infrastructure between them...
host -> agg switch -> core switch -> fw -> core switch -> agg switch -> host
| Assignee | ||
Comment 3•11 years ago
|
||
ping times from the puppet machine look ok
[dcurado@releng-puppet2.srv.releng.scl3.mozilla.com ~]$ ping 10.26.40.1
PING 10.26.40.1 (10.26.40.1) 56(84) bytes of data.
64 bytes from 10.26.40.1: icmp_seq=1 ttl=64 time=3.48 ms
64 bytes from 10.26.40.1: icmp_seq=2 ttl=64 time=4.74 ms
64 bytes from 10.26.40.1: icmp_seq=3 ttl=64 time=1.14 ms
64 bytes from 10.26.40.1: icmp_seq=4 ttl=64 time=1.60 ms
64 bytes from 10.26.40.1: icmp_seq=5 ttl=64 time=8.44 ms
64 bytes from 10.26.40.1: icmp_seq=6 ttl=64 time=11.4 ms
64 bytes from 10.26.40.1: icmp_seq=7 ttl=64 time=5.81 ms
64 bytes from 10.26.40.1: icmp_seq=8 ttl=64 time=1.04 ms
64 bytes from 10.26.40.1: icmp_seq=9 ttl=64 time=0.954 ms
64 bytes from 10.26.40.1: icmp_seq=10 ttl=64 time=1.05 ms
64 bytes from 10.26.40.1: icmp_seq=11 ttl=64 time=0.970 ms
64 bytes from 10.26.40.1: icmp_seq=12 ttl=64 time=1.01 ms
64 bytes from 10.26.40.1: icmp_seq=13 ttl=64 time=1.01 ms
64 bytes from 10.26.40.1: icmp_seq=14 ttl=64 time=1.02 ms
64 bytes from 10.26.40.1: icmp_seq=15 ttl=64 time=1.05 ms
64 bytes from 10.26.40.1: icmp_seq=16 ttl=64 time=1.10 ms
^C
--- 10.26.40.1 ping statistics ---
16 packets transmitted, 16 received, 0% packet loss, time 15092ms
rtt min/avg/max/mdev = 0.954/2.867/11.421/3.081 ms
| Assignee | ||
Comment 4•11 years ago
|
||
puppet host interface looks clean:
[dcurado@releng-puppet2.srv.releng.scl3.mozilla.com ~]$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:50:56:BB:26:9E
inet addr:10.26.48.50 Bcast:10.26.51.255 Mask:255.255.252.0
inet6 addr: fe80::250:56ff:febb:269e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1865239978 errors:0 dropped:0 overruns:0 frame:0
TX packets:3367512530 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:587904140948 (547.5 GiB) TX bytes:3897298090639 (3.5 TiB)
| Assignee | ||
Comment 5•11 years ago
|
||
the core to aggregation interface for the puppet server looks good
dcurado@core1.releng.scl3.mozilla.net> show ethernet-switching table | match 00:50:56:BB:26:9E
releng-srv 00:50:56:bb:26:9e Learn 0 ae18.0
dcurado@core1.releng.scl3.mozilla.net> show interfaces ae18 statistics detail
Physical interface: ae18, Enabled, Physical link is Up
Interface index: 146, SNMP ifIndex: 623, Generation: 149
Description: to_switch1.r601-1
Link-level type: Ethernet, MTU: 1514, Speed: 20Gbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled,
Source filtering: Disabled, Flow control: Disabled, Minimum links needed: 1, Minimum bandwidth needed: 0
Device flags : Present Running
Interface flags: SNMP-Traps Internal: 0x0
Current address: 78:fe:3d:48:ba:55, Hardware address: 78:fe:3d:48:ba:55
Last flapped : 2014-04-05 14:29:53 UTC (8w5d 03:54 ago)
Statistics last cleared: Never
Traffic statistics:
Input bytes : 10590287096990 7685800 bps
Output bytes : 15107354513983 34374880 bps
Input packets: 13971749732 3309 pps
Output packets: 15861401292 4957 pps
IPv6 transit statistics:
Input bytes : 0
Output bytes : 0
Input packets: 0
Output packets: 0
Input errors:
Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Giants: 0, Policed discards: 0, Resource errors: 0
Output errors:
Carrier transitions: 2, Errors: 0, Drops: 0, MTU errors: 0, Resource errors: 0
| Assignee | ||
Comment 6•11 years ago
|
||
aggregation switch port to the puppet2 machine looks OK, but does have 62 carrier transitions.
we don't have a record of when the last one was, and the switch has been for 28 weeks, so...
I did see that this switch was configured 2 days ago, so I'll ask Arzhel about this.
Arzhel:
dcurado@switch1.r601-1.ops.scl3.mozilla.net# show | compare
[edit groups esx-trunk interfaces <*> unit 0 family ethernet-switching vlan]
- members [ private vmotion db releng-dmz releng-private dmz community app-amo app-bugs app-dist app-generic cg-bugs ops qa sandbox seamicro shared web releng-winbuild releng-wintry releng-srv releng-build releng-mobile releng-try vips ad-db metrics stage-bugs stage-metrics anycast sec paas dev-paas releng-wintest vm-labs releng-relabs cloud_admin cloud_core cloud_internal cloud_loaner cloud_try ];
+ members [ private vmotion db releng-dmz releng-private dmz community app-amo app-bugs app-dist app-generic cg-bugs ops qa sandbox seamicro shared web releng-winbuild releng-wintry releng-srv releng-build releng-mobile releng-try vips ad-db metrics stage-bugs stage-metrics anycast sec paas dev-paas releng-wintest vm-labs releng-relabs ];
[edit vlans]
- cloud_admin {
- vlan-id 2102;
- }
- cloud_core {
- vlan-id 296;
- }
- cloud_internal {
- vlan-id 2105;
- }
- cloud_loaner {
- vlan-id 2100;
- }
- cloud_try {
- vlan-id 288;
- }
I can't imagine that removing these vlans would make a performance difference.
But I am curious -- why did we have to remove them?
Flags: needinfo?(arzhel)
| Assignee | ||
Comment 8•11 years ago
|
||
Well, OK then, we did add some vlans to one of these switches 2 days ago, but I think that would
have caused slow downs for a lot of people if that were the cause.
In the meantime, I've traced the connections between these two hosts:
10.26.48.50/32 00:50:56:bb:26:9e vlan 248 (srv) connects switch1.r601-1.ops.scl3 via ae3
switch1.r601-1 connects to core1.releng.scl3 via ae18
core1.releng.scl3 connects to switch1.r401-2.ops.releng via ae11
switch1.r401-2.ops.releng connects via ge-0/0/3 to
10.26.40.153 00:25:90:c5:41:08 vlan 240 (wintest)
I could not find any "smoking gun" on any interface in that path.
There are some carrier transitions, but nothing flapping up and down.
| Assignee | ||
Comment 9•11 years ago
|
||
pasted in the wrong ping output in comment 3:
[dcurado@releng-puppet2.srv.releng.scl3.mozilla.com ~]$ ping 10.26.40.153
PING 10.26.40.153 (10.26.40.153) 56(84) bytes of data.
64 bytes from 10.26.40.153: icmp_seq=1 ttl=127 time=3.63 ms
64 bytes from 10.26.40.153: icmp_seq=2 ttl=127 time=1.06 ms
64 bytes from 10.26.40.153: icmp_seq=3 ttl=127 time=1.09 ms
64 bytes from 10.26.40.153: icmp_seq=4 ttl=127 time=1.03 ms
64 bytes from 10.26.40.153: icmp_seq=5 ttl=127 time=0.904 ms
64 bytes from 10.26.40.153: icmp_seq=6 ttl=127 time=0.862 ms
64 bytes from 10.26.40.153: icmp_seq=7 ttl=127 time=0.898 ms
64 bytes from 10.26.40.153: icmp_seq=8 ttl=127 time=0.899 ms
64 bytes from 10.26.40.153: icmp_seq=9 ttl=127 time=1.00 ms
64 bytes from 10.26.40.153: icmp_seq=10 ttl=127 time=0.924 ms
64 bytes from 10.26.40.153: icmp_seq=11 ttl=127 time=0.882 ms
64 bytes from 10.26.40.153: icmp_seq=12 ttl=127 time=0.900 ms
64 bytes from 10.26.40.153: icmp_seq=13 ttl=127 time=1.11 ms
64 bytes from 10.26.40.153: icmp_seq=14 ttl=127 time=0.961 ms
64 bytes from 10.26.40.153: icmp_seq=15 ttl=127 time=1.09 ms
64 bytes from 10.26.40.153: icmp_seq=16 ttl=127 time=0.846 ms
64 bytes from 10.26.40.153: icmp_seq=17 ttl=127 time=1.05 ms
and a flood ping looks perfect...
[dcurado@releng-puppet2.srv.releng.scl3.mozilla.com ~]$ sudo ping -f -c 1000 10.26.40.153
PING 10.26.40.153 (10.26.40.153) 56(84) bytes of data.
--- 10.26.40.153 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 949ms
rtt min/avg/max/mdev = 0.729/0.931/3.428/0.165 ms, ipg/ewma 0.950/0.937 ms
Comment 10•11 years ago
|
||
Q said he was running some jobs that suddenly sped up last night.
Jake noticed a drastic slowdown in tftp today.
I've needinfoed them to see if they can give you more details (times, src and dest info, etc).
Flags: needinfo?(q)
Flags: needinfo?(jwatkins)
Comment 11•11 years ago
|
||
I noticed a hangup of the initrd downloading while kickstarting keystone1.admin.cloud.releng.scl3. I took several mins to complete the download which I was sure had completely hung. I didn't see any timeouts though, just the dot progress bar come to a stop then suddenly start and finish.
keystone1.admin.cloud.releng.scl3 -> admin1.private.releng.scl3.mozilla.com udp/tftp
Flags: needinfo?(jwatkins)
| Assignee | ||
Comment 12•11 years ago
|
||
Thank you Jake, for providing some specific information.
That is very helpful.
In fact, that kind of specific information allowed me to replicate the problem
you are seeing. pinging admin1.private.releng.scl3.mozilla.com from keystone1.admin.cloud.releng.scl3.mozilla.com
[dcurado@keystone1 ~]$ ping 10.26.75.5
PING 10.26.75.5 (10.26.75.5) 56(84) bytes of data.
64 bytes from 10.26.75.5: icmp_seq=1 ttl=63 time=313 ms
64 bytes from 10.26.75.5: icmp_seq=2 ttl=63 time=294 ms
64 bytes from 10.26.75.5: icmp_seq=3 ttl=63 time=1.14 ms
64 bytes from 10.26.75.5: icmp_seq=4 ttl=63 time=330 ms
64 bytes from 10.26.75.5: icmp_seq=5 ttl=63 time=302 ms
64 bytes from 10.26.75.5: icmp_seq=6 ttl=63 time=306 ms
64 bytes from 10.26.75.5: icmp_seq=7 ttl=63 time=313 ms
64 bytes from 10.26.75.5: icmp_seq=8 ttl=63 time=323 ms
64 bytes from 10.26.75.5: icmp_seq=9 ttl=63 time=280 ms
64 bytes from 10.26.75.5: icmp_seq=10 ttl=63 time=1.02 ms
64 bytes from 10.26.75.5: icmp_seq=11 ttl=63 time=246 ms
64 bytes from 10.26.75.5: icmp_seq=12 ttl=63 time=1.71 ms
64 bytes from 10.26.75.5: icmp_seq=13 ttl=63 time=233 ms
Clearly there is a problem.
Let me dig and see what I can figure out.
| Assignee | ||
Comment 13•11 years ago
|
||
While I don't think the following is the problem, it's something that should be corrected:
[dcurado@keystone1 ~]$ ifconfig -a
eth0 Link encap:Ethernet HWaddr 00:50:56:bb:a5:f8
inet addr:10.26.103.0 Bcast:10.26.103.255 Mask:255.255.254.0
inet6 addr: fe80::250:56ff:febb:a5f8/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:597000 errors:0 dropped:0 overruns:0 frame:0
TX packets:69294 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:859320766 (859.3 MB) TX bytes:14169135 (14.1 MB)
The host part of the IP being .0 is OK in this case, but it might be
worth changing it to something else, just for grins, to see if that
has anything to do with this problem. I doubt it, but what the heck.
| Assignee | ||
Comment 14•11 years ago
|
||
pinging from admin1.scl3.mozilla.com to admin1.releng.scl3.mozilla.com looks fine:
[dcurado@admin1a.private.scl3 ~]$ ping admin1.private.releng.scl3.mozilla.com
PING admin1.private.releng.scl3.mozilla.com (10.26.75.5) 56(84) bytes of data.
64 bytes from admin1.private.releng.scl3.mozilla.com (10.26.75.5): icmp_seq=1 ttl=61 time=1.56 ms
64 bytes from admin1.private.releng.scl3.mozilla.com (10.26.75.5): icmp_seq=2 ttl=61 time=1.52 ms
64 bytes from admin1.private.releng.scl3.mozilla.com (10.26.75.5): icmp_seq=3 ttl=61 time=1.54 ms
64 bytes from admin1.private.releng.scl3.mozilla.com (10.26.75.5): icmp_seq=4 ttl=61 time=1.85 ms
64 bytes from admin1.private.releng.scl3.mozilla.com (10.26.75.5): icmp_seq=5 ttl=61 time=1.45 ms
64 bytes from admin1.private.releng.scl3.mozilla.com (10.26.75.5): icmp_seq=6 ttl=61 time=2.11 ms
But pinging from admin1.scl3.mozilla.net to keystone1.admin.cloud.releng.scl3.mozilla.com
doesn't look as good...
[dcurado@admin1a.private.scl3 ~]$ ping keystone1.admin.cloud.releng.scl3.mozilla.com
PING keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0) 56(84) bytes of data.
64 bytes from keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0): icmp_seq=1 ttl=61 time=27.3 ms
64 bytes from keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0): icmp_seq=2 ttl=61 time=1.60 ms
64 bytes from keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0): icmp_seq=3 ttl=61 time=48.3 ms
64 bytes from keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0): icmp_seq=4 ttl=61 time=50.1 ms
64 bytes from keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0): icmp_seq=5 ttl=61 time=41.2 ms
64 bytes from keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0): icmp_seq=6 ttl=61 time=1.65 ms
64 bytes from keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0): icmp_seq=7 ttl=61 time=15.8 ms
64 bytes from keystone1.admin.cloud.releng.scl3.mozilla.com (10.26.103.0): icmp_seq=8 ttl=61 time=17.7 ms
| Assignee | ||
Comment 15•11 years ago
|
||
ping from admin1.scl3.mozilla.com to the firewall port that connects the cloud_admin vlan looks
fine.
[dcurado@admin1a.private.scl3 ~]$ ping 10.26.102.1
PING 10.26.102.1 (10.26.102.1) 56(84) bytes of data.
64 bytes from 10.26.102.1: icmp_seq=1 ttl=62 time=1.54 ms
64 bytes from 10.26.102.1: icmp_seq=2 ttl=62 time=1.50 ms
64 bytes from 10.26.102.1: icmp_seq=3 ttl=62 time=1.62 ms
64 bytes from 10.26.102.1: icmp_seq=4 ttl=62 time=1.63 ms
64 bytes from 10.26.102.1: icmp_seq=5 ttl=62 time=1.61 ms
64 bytes from 10.26.102.1: icmp_seq=6 ttl=62 time=1.51 ms
64 bytes from 10.26.102.1: icmp_seq=7 ttl=62 time=1.55 ms
64 bytes from 10.26.102.1: icmp_seq=8 ttl=62 time=1.79 ms
64 bytes from 10.26.102.1: icmp_seq=9 ttl=62 time=1.55 ms
64 bytes from 10.26.102.1: icmp_seq=10 ttl=62 time=1.54 ms
The ping times could be better, but the firewall (routers in general)
aren't great at generating packets to do things like respond to pings.
All look good.
| Assignee | ||
Comment 16•11 years ago
|
||
Here are the other hosts I have found on this vlan:
dcurado@fw1.releng.scl3.mozilla.net> show arp no-resolve | match 10.26.103
00:50:56:bb:a5:f8 10.26.103.0 reth0.2102 none
00:50:56:bb:0d:2a 10.26.103.1 reth0.2102 none
00:50:56:bb:41:67 10.26.103.2 reth0.2102 none
00:50:56:bb:e5:db 10.26.103.3 reth0.2102 none
00:50:56:bb:c2:b1 10.26.103.4 reth0.2102 none
00:50:56:bb:d5:6f 10.26.103.5 reth0.2102 none
00:50:56:bb:b1:37 10.26.103.6 reth0.2102 none
00:50:56:bb:93:2e 10.26.103.99 reth0.2102 none
Pinging each of them gives mixed results. Sometimes the ping runs clean,
with echo times of ~1.5 ms for round trip time. Sometimes the ping has
some values between 10 and 20ms -- not a disaster, but not what one would
expect either.
Jake -- are these hosts actual servers, VMs, or ??
Thanks - Dave
Flags: needinfo?(jwatkins)
Comment 17•11 years ago
|
||
(In reply to Dave Curado :dcurado from comment #16)
> 00:50:56:bb:a5:f8 10.26.103.0 reth0.2102 none
> 00:50:56:bb:0d:2a 10.26.103.1 reth0.2102 none
> 00:50:56:bb:41:67 10.26.103.2 reth0.2102 none
> 00:50:56:bb:e5:db 10.26.103.3 reth0.2102 none
> 00:50:56:bb:c2:b1 10.26.103.4 reth0.2102 none
> 00:50:56:bb:d5:6f 10.26.103.5 reth0.2102 none
> 00:50:56:bb:b1:37 10.26.103.6 reth0.2102 none
> 00:50:56:bb:93:2e 10.26.103.99 reth0.2102 none
> Jake -- are these hosts actual servers, VMs, or ??
Dave, all of those are VMs on the scl3 ESX cluster including .103.99 which is a second vif on the openstack-testing VM. It doesn't have a inventory entry since it is only temporary.
https://inventory.mozilla.org/en-US/core/network/565/
Flags: needinfo?(jwatkins)
| Assignee | ||
Comment 18•11 years ago
|
||
I couldn't get 10.26.103.99 to answer to a ping, so <shrug> not sure what's up with that one.
I'm going to look at and document each port in the entire path.
from core1.releng.scl3.mozilla.net, the aggregation switch for this connection is on AE18, which
is comprised of xe-0/0/34 and xe-1/0/34.
The interface statistics on both those interfaces on the core1.releng.scl3 side look good:
xe-0/0/34
Input errors:
Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0,
FIFO errors: 0, Resource errors: 0
Output errors:
Carrier transitions: 1, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0,
Resource errors: 0
xe-1/0/34
Input errors:
Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0,
FIFO errors: 0, Resource errors: 0
Output errors:
Carrier transitions: 1, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0,
Resource errors: 0
| Assignee | ||
Comment 19•11 years ago
|
||
On the aggregation switch side, switch1.r601-1.ops.scl3.mozilla.net connects to core1.releng.scl3 via
AE5, which is comprised of xe-0/0/10 and xe-1/0/10. Both those interfaces also look clean.
xe-0/0/10:
Input errors:
Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0,
FIFO errors: 0, Resource errors: 0
Output errors:
Carrier transitions: 17, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0,
Resource errors: 0
xe-1/0/10:
Input errors:
Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0,
FIFO errors: 0, Resource errors: 0
Output errors:
Carrier transitions: 5, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0,
Resource errors: 0
Comment 20•11 years ago
|
||
cc :gcox since we're seeing inter VM network delays
If you are pinging .103.99 from anywhere other than it's L2 domain, it will fail since there are no gateways assigned to that vif on that VM. We can probably remove the vif anyway. It is not longer needed.
| Assignee | ||
Comment 21•11 years ago
|
||
The ESX cluster is connected to the aggregation switch via AE3, which is comprised of xe-0/0/1 and
xe-1/0/1. These interfaces look clean.
xe-0/0/1:
Input errors:
Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0,
FIFO errors: 0, Resource errors: 0
Output errors:
Carrier transitions: 51, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0,
Resource errors: 0
xe-1/0/1:
Input errors:
Errors: 0, Drops: 0, Framing errors: 0, Runts: 0, Policed discards: 0, L3 incompletes: 0, L2 channel errors: 0, L2 mismatch timeouts: 0,
FIFO errors: 0, Resource errors: 0
Output errors:
Carrier transitions: 11, Errors: 0, Drops: 0, Collisions: 0, Aged packets: 0, FIFO errors: 0, HS link CRC errors: 0, MTU errors: 0,
Resource errors: 0
Jake -- are there other VMs on that ESX cluster besides your VMs for this openstack stage project?
Thanks.
Comment 22•11 years ago
|
||
> Jake -- are there other VMs on that ESX cluster besides your VMs for this
> openstack stage project?
> Thanks.
Yes. There are many other VMs. :gcox can provide more info on the scl3 vmware cluster.
Comment 23•11 years ago
|
||
Just catching up here. Correlation, 'Tuesday' in comment 0 lines up with https://bugzilla.mozilla.org/show_bug.cgi?id=1019201#c1 but that doesn't feel like causation: because at that point the upstream links from 601-1 to relengcore hadn't changed; the 'cloud' vlans weren't brought in until Friday, so ESX was an island as far as 'cloud' was concerned, and unchanged as far as 'releng' is concerned.
This feels like we've derailed from the original request. I've seen flood-pings all Just Work from the IPs I've looked at in reading so far, including comment 0.. so I guess I'm not sure what you need from me, or what problem we're still seeing.
| Assignee | ||
Comment 24•11 years ago
|
||
Greg -- I think you have given me/us the information I was looking for.
That is to say: you are able to flood ping out from other VMs on this ESX cluster,
and you don't see a performance problem.
I have been able to reproduce some performance problems (see comment #12)
but I am not seeing anything obvious with the switch connections.
That has got me wondering about the ESX cluster, and that's why I wanted to
know about the other VMs.
Can I ask a favor? Can you try pinging admin1.releng.scl3.mozilla.com from a couple of
your VMs on that ESX cluster and let me know how it looks?
(or, let me know the host names or IPs and I can try logging in)
Thanks!
Flags: needinfo?(gcox)
Comment 25•11 years ago
|
||
Sample set of 3: one from :dividehex's cloud VLAN, one from comment 0's vlan, and one outside of releng-land. The MAC of 00:50:56:bl:aa:hh signifies "it's a VM."
$ date ; for i in 10.26.103.2 10.26.48.50 10.22.75.132 ; do echo $i ; ssh root@$i "/sbin/ifconfig eth0 | head -2 ; ping -f -c 10000 10.26.75.6" ; done ; date
Tue Jun 10 14:51:41 EDT 2014
10.26.103.2
eth0 Link encap:Ethernet HWaddr 00:50:56:bb:41:67
inet addr:10.26.103.2 Bcast:10.26.103.255 Mask:255.255.254.0
PING 10.26.75.6 (10.26.75.6) 56(84) bytes of data.
--- 10.26.75.6 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 8885ms
rtt min/avg/max/mdev = 0.622/0.862/9.899/0.339 ms, ipg/ewma 0.888/1.025 ms
10.26.48.50
eth0 Link encap:Ethernet HWaddr 00:50:56:BB:26:9E
inet addr:10.26.48.50 Bcast:10.26.51.255 Mask:255.255.252.0
PING 10.26.75.6 (10.26.75.6) 56(84) bytes of data.
--- 10.26.75.6 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 10264ms
rtt min/avg/max/mdev = 0.601/1.002/6.239/0.540 ms, ipg/ewma 1.026/0.910 ms
10.22.75.132
eth0 Link encap:Ethernet HWaddr 00:50:56:BB:47:D8
inet addr:10.22.75.132 Bcast:10.22.75.255 Mask:255.255.255.0
PING 10.26.75.6 (10.26.75.6) 56(84) bytes of data.
--- 10.26.75.6 ping statistics ---
10000 packets transmitted, 10000 received, 0% packet loss, time 15013ms
rtt min/avg/max/mdev = 1.220/1.471/6.042/0.217 ms, ipg/ewma 1.501/1.446 ms
Tue Jun 10 14:52:20 EDT 2014
Flags: needinfo?(gcox)
| Assignee | ||
Comment 26•11 years ago
|
||
Thank you Greg.
The avg times on those flood pings look good to me.
Jake -- I'm not coming up with any smoking guns here.
Can you confirm that the problem you were seeing is still occurring?
Thanks.
Flags: needinfo?(jwatkins)
| Assignee | ||
Comment 27•11 years ago
|
||
I have received no further information on this issue.
Please re-open the bug and provide further details if this problem still exists.
i.e. if you are seeing a slow down in a file transfer from one host to another,
then please
- ping the dst host from the src host, let it run for 10 seconds or so, then
put that information in the bug.
- if you have mtr installed, please do an mtr from the src host to the dst host,
and put that information in the bug.
Thanks,
Dave
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → INCOMPLETE
Comment 28•11 years ago
|
||
Sorry, I meant to reply to this before I left on Friday and never had the chance. I don't think we've seen the issue for a while (it always mysteriously clears up after a hour or a day or two). The vmware host being the issue is an interesting hypothesis, though. All of the things we normally do where we (relops) notice the slowness would, I think, be traffic to or from a vm (puppet, domain controllers, wds, etc). I'm not sure if we also see the issue during OS X installs, which would point at a different culprit (since that's very different hardware). When we notice this issue again, the vmware amchines might be a good place to start investigating.
Flags: needinfo?(q)
Flags: needinfo?(jwatkins)
Updated•3 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•