Closed Bug 808129 Opened 13 years ago Closed 12 years ago

data corruption on github-sync1-dev.dmz.scl3, odd messages in /var/log/messages

Categories

(Developer Services :: General, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: hwine, Assigned: afernandez)

References

Details

(Whiteboard: [reit-ops])

while investigating a repository issue for bug 807702 on github-sync1-dev.dmz.scl3, it appears that data on the box became corrupted somewhere between 1030 PT & 1100 PT on Oct 31, 2012. 2 questions: 1) In that time range, there are a lot of dhcp messages in /var/log/messages, and also rsynclogd reports dropping messages. Does this point to anything that could cause the corruption (there would have been an 'hg update' operation running during this time) 2) were there any network/data center events in this time period that could have caused corruption? I am in the process of saving off the corrupted repository, I'll update this bug with more information when I have restored services.
Assignee: server-ops-devservices → network-operations
Component: Server Operations: Developer Services → Server Operations: Netops
QA Contact: shyam → ravi
The damaged repository has been moved to: /mnt/netapp/github_sync/converted_repos/bug807702/releases-mozilla-central-bad-hg and still shows the corruption: [vcs2vcs@github-sync1-dev.dmz.scl3 releases-mozilla-central-bad-hg]$ hg verify checking changesets checking manifests crosschecking files in changesets and manifests checking files configure.in@111932: 5b70b88f6ea1 in manifests not found mobile/android/base/DoorHangerPopup.java@111754: 62e6137d125c in manifests not found 109125 files, 112088 changesets, 631673 total revisions 2 integrity errors encountered! (first damaged changeset appears to be 111754) I believe this is locally caused corruption due to the following sequence in the vcs2vcs log files: 2012-10-30T10:50:09-0700: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull pulling from http://hg.mozilla.org/mozilla-central searching for changes no changes found 2012-10-30T10:50:09-0700;0:00.63;0.63;0.05;0.23;258272;11;9;0;0;0;0;0;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull; ... 2012-10-30T10:50:16-0700: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport 2012-10-30T10:50:16-0700;0:09.84;9.84;0.35;9.47;2611536;23;0;0;118704;0;0;0;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport; 2012-10-30T10:50:26-0700: starting: git --git-dir /opt/vcs2vcs/repos/releases-mozilla-central/.git push --force --mirror git.m.o 2012-10-30T10:50:26-0700;0:00.43;0.43;0.03;0.00;14304;3;167;0;0;0;0;0;git --git-dir /opt/vcs2vcs/repos/releases-mozilla-central/.git push --force --mirror git.m.o; ... 2012-10-30T11:00:32-0700: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull pulling from http://hg.mozilla.org/mozilla-central searching for changes adding changesets adding manifests adding file changes added 1 changesets with 1 changes to 1 files 2012-10-30T11:00:32-0700;0:02.05;2.05;0.09;1.01;411520;15;11;0;8368;0;0;0;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull; ... 2012-10-30T11:00:42-0700: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport exporting hg objects to git abort: data/mobile/android/base/DoorHangerPopup.java.i@62e6137d125c: no match found! 2012-10-30T11:00:42-0700;0:07.55;7.55;0.24;7.29;1416864;21;2;0;117248;0;0;255;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport;
(In reply to Hal Wine [:hwine] from comment #0) > while investigating a repository issue for bug 807702 on > github-sync1-dev.dmz.scl3, it appears that data on the box became corrupted > somewhere between 1030 PT & 1100 PT on Oct 31, 2012. CORRECTION: Date of event is Oct 30 - times are correct.
Found another corruption, same host, different day - see bug 807630 Catlee also pointed out this is very similar to bug 718186, which points to a back end issue.
My keys don't appear to be on the host so I can't really dive very deep. The host appears to be on a Seamicro, however, and the ae interfaces show no errors. I'm not sure there is much more I can do at this point. It seems rather unlikely the data corruption at this level would be caused by the network. What is the host pair here? I see hg is involved, but I'm not sure if the storage issues we were having could be playing a role here.
Punting back into dev services due to no network issue (comment #4) and issue on another box (comment #3). That makes it look more like our friend bug 718186.
Assignee: network-operations → server-ops-devservices
Component: Server Operations: Netops → Server Operations: Developer Services
QA Contact: ravi → shyam
Another occurrence - from app logs: 2012-11-13T15:58:07-0800: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull 2012-11-13T15:58:07-0800;0:03.20;3.20;0.13;1.60;453744;12;15;32;21488;0;0;0;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central pull; 2012-11-13T15:58:17-0800: starting: hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport exporting hg objects to git abort: data/build/unix/build-clang/build-clang.py.i@210595260086: no match found! 2012-11-13T15:58:17-0800;0:18.94;18.94;1.61;17.28;1474576;32;2;0;118768;0;0;255;hg --cwd /opt/vcs2vcs/repos/releases-mozilla-central gexport From /var/log/messages around that time: Nov 13 15:58:13 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d ) Nov 13 15:58:26 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d) Nov 13 15:58:39 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d) Nov 13 15:58:47 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d) Nov 13 15:59:07 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d) Nov 13 15:59:17 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d) Nov 13 15:59:26 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d) Nov 13 15:59:38 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d) Nov 13 15:59:56 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d) Nov 13 16:00:02 github-sync1-dev puppet-agent[5961]: Sleeping for 1225 seconds (splay is enabled) Nov 13 16:00:02 github-sync1-dev kernel: audit: audit_lost=4132809 audit_rate_limit=400 audit_backlog_limit=320 Nov 13 16:00:03 github-sync1-dev kernel: audit: audit_lost=4134669 audit_rate_limit=400 audit_backlog_limit=320 Nov 13 16:00:04 github-sync1-dev rsyslogd-2177: imuxsock begins to drop messages from pid 10702 due to rate-limiting Nov 13 16:00:04 github-sync1-dev kernel: audit: audit_lost=4136489 audit_rate_limit=400 audit_backlog_limit=320 Nov 13 16:00:05 github-sync1-dev kernel: audit: audit_lost=4138318 audit_rate_limit=400 audit_backlog_limit=320 Nov 13 16:00:11 github-sync1-dev dhclient[5563]: DHCPREQUEST on bond0 to 10.22.75.7 port 67 (xid=0x4f05a00d
bumping up priority - b2g & 3rd occurrence - we're about to add more load to these boxes.
Severity: normal → major
Whiteboard: [reit-ops]
Assignee: server-ops-devservices → shyam
Depends on: 811594
Hal, I'm not sure what you're trying to tell me here. The bug you linked to is related to Varnish, which we don't have anymore in the system. If there are issues with hg, all the builders would hit it, not just these machines?
Shyam - I'm taking a shot in the dark, as _something_ bad is going on. These boxes are not in the build-vpn, so are just hitting the normal hg.m.o web-heads. The builders wouldn't be affected, right? If Ravi says all those DHCP & dropped messages & lost audits aren't a problem, and I've seen it on two boxes, something with the path to the webheads seems most likely. I don't know the details of the varnish issue - catlee thought it was similar enough functional impact (from our viewpoint) to make the connection. Is there anything in the current setup that could cause a similar issue? My goal is to get some action going, so the problem can be rectified before too many more cycles of it. (these outages page me)
Seems like the interface was dropping packets [shyam@github-sync1-dev.dmz.scl3 ~]$ ifconfig bond0 Link encap:Ethernet HWaddr 00:22:99:36:86:22 inet addr:10.22.74.67 Bcast:10.22.74.255 Mask:255.255.255.0 inet6 addr: fe80::222:99ff:fe36:8622/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:8230632703 errors:0 dropped:12289921 overruns:0 frame:0 TX packets:9158162303 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2015615325108 (1.8 TiB) TX bytes:3609669325864 (3.2 TiB) eth0 Link encap:Ethernet HWaddr 00:22:99:36:86:22 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:8219237959 errors:0 dropped:12289921 overruns:0 frame:0 TX packets:9154779397 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2013917926214 (1.8 TiB) TX bytes:3609077317536 (3.2 TiB) I don't see it increasing, so this could be past errors (previous network issues in SCL3 etc). We could stop all your stuff and reboot the box and watch if it happens again?
So, the hypothesis is that something didn't fully recover after the various network events. I can buy that - happens to my laptop quite frequently. Okay - we'd need to coordinate on the reboot, so I can stop/restart services cleanly. I'll ping in #infra to set up a time.
(In reply to Hal Wine [:hwine] from comment #11) > So, the hypothesis is that something didn't fully recover after the various > network events. I can buy that - happens to my laptop quite frequently. Correct. > Okay - we'd need to coordinate on the reboot, so I can stop/restart services > cleanly. I'll ping in #infra to set up a time. I'm going to be AFK for the next few days, you have sudo, you can just stop everything and run a sudo reboot ? :D Let me know if you're not okay with that approach, we'll work something out. Thanks!
Node was rebooted by hwine and it's back up.
All vcs2vcs processes restarted. For future reference, 244 dropped packets and a couple odd messages after reboot From /var/log/messages: Nov 14 19:20:01 github-sync1-dev rsyslogd-2177: imuxsock begins to drop messages from pid 2112 due to rate-limiting Nov 14 19:20:02 github-sync1-dev rsyslogd-2177: imuxsock lost 13 messages from pid 2112 due to rate-limiting From ifconfig: bond0 Link encap:Ethernet HWaddr 00:22:99:36:86:22 inet addr:10.22.74.67 Bcast:10.22.74.255 Mask:255.255.255.0 inet6 addr: fe80::222:99ff:fe36:8622/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:2144294 errors:0 dropped:244 overruns:0 frame:0 TX packets:2082069 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:746614036 (712.0 MiB) TX bytes:429273843 (409.3 MiB) ... eth6 Link encap:Ethernet HWaddr 00:22:99:36:86:22 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:2135576 errors:0 dropped:244 overruns:0 frame:0 TX packets:89 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:745541161 (711.0 MiB) TX bytes:13459 (13.1 KiB)
Component: Server Operations: Developer Services → Server Operations: Netops
QA Contact: shyam → ravi
Netops, can we find out why this host's interface keeps losing packets?
Meh, I'll do some more checking here, seems like we have a slightly different config across hosts.
Component: Server Operations: Netops → Server Operations: Developer Services
QA Contact: ravi → shyam
Going to pull in some external help here. Aj, can you check and find out if the hosts have different network configs (they seem like they do) and I guess we should have them on the correct seamicro configs. We might need to sync with Hal before making any changes though.
Assignee: shyam → afernandez
Another occurrence: vcs-sync host: github-sync1-dev.dmz.scl3.mozilla.com corrupted repo location: /mnt/netapp/github_sync/converted_repos/build-tools/.hg affected downstream repo: https://github.com/mozilla/build-tools last good pull/update: 2013-01-10T06:33:45-0800 next good pull/update: 2013-01-10T11:29:02-0800 out of service for: 4:55:17 replacing hg repo resolved issue
Blocks: vcs-sync
Shyam not seeing any discrepancies in regards to the network settings. The node is using the correct network settings; $ cat /etc/modprobe.d/local.conf alias bond0 bonding options bonding miimon=100 mode=4 Currently looking for other issues, however, possibly the NFS mount may be at fault and/or part of the issue. Currently bond0 has 1391360 dropped packets. Of these packets 1391347 were dropped on eth6, 6 on eth5 and 7 on eth4. Essentially a way of reproducing the issue would be best but will attempt to at least replicate the dropping of packets.
No definitive root cause, but no longer worth looking. We have changed the setup to not keep repositories on NFS and trouble hasn't been seen since.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INCOMPLETE
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.