Closed
Bug 808129
Opened 13 years ago
Closed 12 years ago
data corruption on github-sync1-dev.dmz.scl3, odd messages in /var/log/messages
Categories
(Developer Services :: General, task)
Developer Services
General
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: hwine, Assigned: afernandez)
References
Details
(Whiteboard: [reit-ops])
while investigating a repository issue for bug 807702 on github-sync1-dev.dmz.scl3, it appears that data on the box became corrupted somewhere between 1030 PT & 1100 PT on Oct 31, 2012.
2 questions:
1) In that time range, there are a lot of dhcp messages in /var/log/messages, and also rsynclogd reports dropping messages. Does this point to anything that could cause the corruption (there would have been an 'hg update' operation running during this time)
2) were there any network/data center events in this time period that could have caused corruption?
I am in the process of saving off the corrupted repository, I'll update this bug with more information when I have restored services.
Updated•13 years ago
|
Assignee: server-ops-devservices → network-operations
Component: Server Operations: Developer Services → Server Operations: Netops
QA Contact: shyam → ravi
The damaged repository has been moved to:
/mnt/netapp/github_sync/converted_repos/bug807702/releases-mozilla-central-bad-hg
and still shows the corruption:
[vcs2vcs@github-sync1-dev.dmz.scl3 releases-mozilla-central-bad-hg]$ hg verify
checking changesets
checking manifests
crosschecking files in changesets and manifests
checking files
configure.in@111932: 5b70b88f6ea1 in manifests not found
mobile/android/base/DoorHangerPopup.java@111754: 62e6137d125c in manifests not found
109125 files, 112088 changesets, 631673 total revisions
2 integrity errors encountered!
(first damaged changeset appears to be 111754)
I believe this is locally caused corruption due to the following
sequence in the vcs2vcs log files:
2012-10-30T10:50:09-0700: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull
pulling from http://hg.mozilla.org/mozilla-central
searching for changes
no changes found
2012-10-30T10:50:09-0700;0:00.63;0.63;0.05;0.23;258272;11;9;0;0;0;0;0;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull;
...
2012-10-30T10:50:16-0700: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport
2012-10-30T10:50:16-0700;0:09.84;9.84;0.35;9.47;2611536;23;0;0;118704;0;0;0;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport;
2012-10-30T10:50:26-0700: starting: git --git-dir /opt/vcs2vcs/repos/releases-mozilla-central/.git push --force --mirror git.m.o
2012-10-30T10:50:26-0700;0:00.43;0.43;0.03;0.00;14304;3;167;0;0;0;0;0;git --git-dir /opt/vcs2vcs/repos/releases-mozilla-central/.git push --force --mirror git.m.o;
...
2012-10-30T11:00:32-0700: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull
pulling from http://hg.mozilla.org/mozilla-central
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
2012-10-30T11:00:32-0700;0:02.05;2.05;0.09;1.01;411520;15;11;0;8368;0;0;0;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull;
...
2012-10-30T11:00:42-0700: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport
exporting hg objects to git
abort: data/mobile/android/base/DoorHangerPopup.java.i@62e6137d125c: no match found!
2012-10-30T11:00:42-0700;0:07.55;7.55;0.24;7.29;1416864;21;2;0;117248;0;0;255;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport;
(In reply to Hal Wine [:hwine] from comment #0)
> while investigating a repository issue for bug 807702 on
> github-sync1-dev.dmz.scl3, it appears that data on the box became corrupted
> somewhere between 1030 PT & 1100 PT on Oct 31, 2012.
CORRECTION: Date of event is Oct 30 - times are correct.
Found another corruption, same host, different day - see bug 807630
Catlee also pointed out this is very similar to bug 718186, which points to a back end issue.
Comment 4•13 years ago
|
||
My keys don't appear to be on the host so I can't really dive very deep. The host appears to be on a Seamicro, however, and the ae interfaces show no errors.
I'm not sure there is much more I can do at this point. It seems rather unlikely the data corruption at this level would be caused by the network.
What is the host pair here? I see hg is involved, but I'm not sure if the storage issues we were having could be playing a role here.
Punting back into dev services due to no network issue (comment #4) and issue on another box (comment #3). That makes it look more like our friend bug 718186.
Assignee: network-operations → server-ops-devservices
Component: Server Operations: Netops → Server Operations: Developer Services
QA Contact: ravi → shyam
Another occurrence - from app logs:
2012-11-13T15:58:07-0800: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull
2012-11-13T15:58:07-0800;0:03.20;3.20;0.13;1.60;453744;12;15;32;21488;0;0;0;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull;
2012-11-13T15:58:17-0800: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport
exporting hg objects to git
abort: data/build/unix/build-clang/build-clang.py.i@210595260086: no match found!
2012-11-13T15:58:17-0800;0:18.94;18.94;1.61;17.28;1474576;32;2;0;118768;0;0;255;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport
From /var/log/messages around that time:
Nov 13 15:58:13 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d
)
Nov 13 15:58:26 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d)
Nov 13 15:58:39 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d)
Nov 13 15:58:47 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d)
Nov 13 15:59:07 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d)
Nov 13 15:59:17 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d)
Nov 13 15:59:26 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d)
Nov 13 15:59:38 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d)
Nov 13 15:59:56 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d)
Nov 13 16:00:02 github-sync1-dev puppet-agent[5961]: Sleeping for 1225 seconds (splay is enabled)
Nov 13 16:00:02 github-sync1-dev kernel: audit: audit_lost=4132809 audit_rate_limit=400 audit_backlog_limit=320
Nov 13 16:00:03 github-sync1-dev kernel: audit: audit_lost=4134669 audit_rate_limit=400 audit_backlog_limit=320
Nov 13 16:00:04 github-sync1-dev rsyslogd-2177: imuxsock begins to drop messages from pid 10702 due to rate-limiting
Nov 13 16:00:04 github-sync1-dev kernel: audit: audit_lost=4136489 audit_rate_limit=400 audit_backlog_limit=320
Nov 13 16:00:05 github-sync1-dev kernel: audit: audit_lost=4138318 audit_rate_limit=400 audit_backlog_limit=320
Nov 13 16:00:11 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d
bumping up priority - b2g & 3rd occurrence - we're about to add more load to these boxes.
Severity: normal → major
Whiteboard: [reit-ops]
Assignee | ||
Updated•13 years ago
|
Assignee: server-ops-devservices → shyam
Comment 8•13 years ago
|
||
Hal,
I'm not sure what you're trying to tell me here. The bug you linked to is related to Varnish, which we don't have anymore in the system. If there are issues with hg, all the builders would hit it, not just these machines?
Shyam - I'm taking a shot in the dark, as _something_ bad is going on. These boxes are not in the build-vpn, so are just hitting the normal hg.m.o web-heads. The builders wouldn't be affected, right?
If Ravi says all those DHCP & dropped messages & lost audits aren't a problem, and I've seen it on two boxes, something with the path to the webheads seems most likely.
I don't know the details of the varnish issue - catlee thought it was similar enough functional impact (from our viewpoint) to make the connection. Is there anything in the current setup that could cause a similar issue?
My goal is to get some action going, so the problem can be rectified before too many more cycles of it. (these outages page me)
Comment 10•13 years ago
|
||
Seems like the interface was dropping packets
[shyam@github-sync1-dev.dmz.scl3 ~]$ ifconfig
bond0 Link encap:Ethernet HWaddr 00:22:99:36:86:22
inet addr:10.22.74.67 Bcast:10.22.74.255 Mask:255.255.255.0
inet6 addr: fe80::222:99ff:fe36:8622/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:8230632703 errors:0 dropped:12289921 overruns:0 frame:0
TX packets:9158162303 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2015615325108 (1.8 TiB) TX bytes:3609669325864 (3.2 TiB)
eth0 Link encap:Ethernet HWaddr 00:22:99:36:86:22
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:8219237959 errors:0 dropped:12289921 overruns:0 frame:0
TX packets:9154779397 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2013917926214 (1.8 TiB) TX bytes:3609077317536 (3.2 TiB)
I don't see it increasing, so this could be past errors (previous network issues in SCL3 etc).
We could stop all your stuff and reboot the box and watch if it happens again?
Reporter | ||
Comment 11•13 years ago
|
||
So, the hypothesis is that something didn't fully recover after the various network events. I can buy that - happens to my laptop quite frequently.
Okay - we'd need to coordinate on the reboot, so I can stop/restart services cleanly. I'll ping in #infra to set up a time.
Comment 12•13 years ago
|
||
(In reply to Hal Wine [:hwine] from comment #11)
> So, the hypothesis is that something didn't fully recover after the various
> network events. I can buy that - happens to my laptop quite frequently.
Correct.
> Okay - we'd need to coordinate on the reboot, so I can stop/restart services
> cleanly. I'll ping in #infra to set up a time.
I'm going to be AFK for the next few days, you have sudo, you can just stop everything and run a sudo reboot ? :D
Let me know if you're not okay with that approach, we'll work something out. Thanks!
Assignee | ||
Comment 13•13 years ago
|
||
Node was rebooted by hwine and it's back up.
Reporter | ||
Comment 14•13 years ago
|
||
All vcs2vcs processes restarted.
For future reference, 244 dropped packets and a couple odd messages after reboot
From /var/log/messages:
Nov 14 19:20:01 github-sync1-dev rsyslogd-2177: imuxsock begins to drop messages from pid 2112 due to rate-limiting
Nov 14 19:20:02 github-sync1-dev rsyslogd-2177: imuxsock lost 13 messages from pid 2112 due to rate-limiting
From ifconfig:
bond0 Link encap:Ethernet HWaddr 00:22:99:36:86:22
inet addr:10.22.74.67 Bcast:10.22.74.255 Mask:255.255.255.0
inet6 addr: fe80::222:99ff:fe36:8622/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:2144294 errors:0 dropped:244 overruns:0 frame:0
TX packets:2082069 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:746614036 (712.0 MiB) TX bytes:429273843 (409.3 MiB)
...
eth6 Link encap:Ethernet HWaddr 00:22:99:36:86:22
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:2135576 errors:0 dropped:244 overruns:0 frame:0
TX packets:89 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:745541161 (711.0 MiB) TX bytes:13459 (13.1 KiB)
Updated•13 years ago
|
Component: Server Operations: Developer Services → Server Operations: Netops
QA Contact: shyam → ravi
Comment 15•13 years ago
|
||
Netops, can we find out why this host's interface keeps losing packets?
Comment 16•13 years ago
|
||
Meh,
I'll do some more checking here, seems like we have a slightly different config across hosts.
Component: Server Operations: Netops → Server Operations: Developer Services
QA Contact: ravi → shyam
Comment 17•13 years ago
|
||
Going to pull in some external help here.
Aj, can you check and find out if the hosts have different network configs (they seem like they do) and I guess we should have them on the correct seamicro configs. We might need to sync with Hal before making any changes though.
Assignee: shyam → afernandez
Reporter | ||
Comment 18•13 years ago
|
||
Another occurrence:
vcs-sync host: github-sync1-dev.dmz.scl3.mozilla.com
corrupted repo location: /mnt/netapp/github_sync/converted_repos/build-tools/.hg
affected downstream repo: https://github.com/mozilla/build-tools
last good pull/update: 2013-01-10T06:33:45-0800
next good pull/update: 2013-01-10T11:29:02-0800
out of service for: 4:55:17
replacing hg repo resolved issue
Blocks: vcs-sync
Assignee | ||
Comment 19•13 years ago
|
||
Shyam not seeing any discrepancies in regards to the network settings. The node is using the correct network settings;
$ cat /etc/modprobe.d/local.conf
alias bond0 bonding
options bonding miimon=100 mode=4
Currently looking for other issues, however, possibly the NFS mount may be at fault and/or part of the issue.
Currently bond0 has 1391360 dropped packets. Of these packets 1391347 were dropped on eth6, 6 on eth5 and 7 on eth4.
Essentially a way of reproducing the issue would be best but will attempt to at least replicate the dropping of packets.
Reporter | ||
Comment 20•12 years ago
|
||
No definitive root cause, but no longer worth looking.
We have changed the setup to not keep repositories on NFS and trouble hasn't been seen since.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INCOMPLETE
Updated•11 years ago
|
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•