Closed Bug 729020 Opened 13 years ago Closed 13 years ago

hg2.build.scl1 & hg1.build.scl1 have several repos which have 'stale sync data'

Categories

(Developer Services :: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: ashish)

References

Details

eg Mercurial mirror sync - /try - sync data is stale. 11311 seconds full set - https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=hg2.build.scl1 For context hg2 is used as hg.build.m.o, which saves hg.m.o from load from the RelEng build & test slaves. They'll fail over to hg.m.o if they can't get the revision they want from the mirror. hg1 isn't in use, AFAIK.
Assignee: server-ops → server-ops-devservices
Component: Server Operations → Server Operations: Developer Services
QA Contact: mrz → shyam
FYI, here's what a linux compile slaves gets: command: START command: hg pull -r 4b2b0f2b92b8d2035242b3c1bd068106d8042fbc http://hg.build.scl1.mozilla.com/integration/mozilla-inbound command: cwd: /builds/hg-shared/integration/mozilla-inbound command: output: abort: HTTP Error 403: Forbidden command: END (0.23s elapsed) then it falls back to hg.m.o.
Bug 729033 for comment #1, turns out it's anything not in the root.
cshields: Is this related to bug#718533 ? I believe this will cause load to bypass mirrors and flow directly to hg.m.o, which may become a production problem once developer checkin load ramps up in the morning. Leaving as "normal" for now, but cc-ing Rail (buildduty tmrw) so he can keep an eye on it.
I think there might be two different issues here : 1) Data is stale. There is a yaml file in /dev/shm/check_hg_mirrorsync/state which is what the nagios check uses to determine "staleness" The check is hitting this condition right now : if(data_age > max_data_age): print "sync data is stale. %i seconds" % data_age sys.exit(1) Which means that state file hasn't been touched in so long. I've tried kicking the mirror processes, but I'm not sure that really achieved anything. Nothing is holding the state file open either : [root@hg2 ~]# lsof /dev/shm/check_hg_mirrorsync/state [root@hg2 ~]# I don't want to blow it away just yet, I'm still looking at what creates the file. 2) The slave is failing because the repo is either incorrectly configured or has some other issue. I haven't investigated further since I'm looking at the stale data issue first.
So mirror-daemon which runs on dm-svn02 says : 2012-02-21 02:26:23.489236500 Spawned 1 processes, 0 pending 2012-02-21 02:26:23.489239500 in reap_children, nchildren = 1 2012-02-21 02:26:31.489212500 ERROR: Push of /integration/mozilla-inbound to hg@hg2.build.scl1.mozilla.com returned 255 2012-02-21 02:26:31.489215500 Output: integration/mozilla-inbound already exists, pulling 2012-02-21 02:26:31.489216500 abort: error: Connection refused Debugging that further.
Assignee: server-ops-devservices → shyam
Forcing a push doesn't work either : [hg@dm-svn02 ~]$ /repo/hg/libraries/mozhghooks/push_repo.py -r /integration/mozilla-inbound -H hg@hg2.build.scl1.mozilla.com Spawned [/usr/bin/ssh -n -i/etc/mercurial/ssh/id_rsa hg@hg2.build.scl1.mozilla.com hg pull /integration/mozilla-inbound] as pid 23652 Enter passphrase for key '/etc/mercurial/ssh/id_rsa': Enter passphrase for key '/etc/mercurial/ssh/id_rsa': Enter passphrase for key '/etc/mercurial/ssh/id_rsa': Job finished with code 255. Output follows: Permission denied (publickey,gssapi-with-mic). Seems like the ssh key is password protected, I'm not sure why that would be the case though...and why this changed overnight.
So we debugged some more and ran into : [03:48:48] <@ fox2mike> | [hg@hg2 ~]$ /usr/local/bin/mirror-pull /integration/mozilla-inbound [03:48:48] <@ fox2mike> | integration/mozilla-inbound already exists, pulling [03:48:49] <@ fox2mike> | abort: error: Connection refused And then we checked the zeus pool etc...but ashish hit the jackpot with : [04:01:30] < ashish> | [hg@hg2 ~]$ telnet hg.mozilla.org 80 [04:01:31] < ashish> | Trying 10.2.74.69... [04:01:31] < ashish> | telnet: connect to address 10.2.74.69: Connection refused [04:01:31] < ashish> | telnet: Unable to connect to remote host: Connection refused [04:01:38] < ashish> | 69.74.2.10.in-addr.arpa domain name pointer dm-vcview02.mozilla.org. [04:02:38] < ashish> | how/why is telnet only connecting to .69? [04:02:48] < ashish> | 10.2.74.69 hg.mozilla.org [04:02:50] < ashish> | bwahahaha [04:02:52] < ashish> | /etc/hosts [04:03:03] < ashish> | fox2mike: ^^ [04:03:28] < ashish> | -rw-r--r-- 1 root root 281 Jan 30 08:59 /etc/hosts Which means these mirrors were only treating dm-vcview02 as their "source". Now I'm curious to know who put that /etc/hosts entry on these mirrors :) dm-vcview02 and 03 have been down for the last 7 hours because of bug 729062. We applied a temp fix for 01 and got that back online but were looking at debugging 02 and 03. This is the primary reason why these mirrors were lagging. The /etc/hosts entry *prevented* this from recovering much earlier and in fact was the cause of this issue...there were other functional hosts in the pool that would have ensured the mirrors never saw this..but the /etc/hosts entry ensured we kept hitting the failed node. I'm marking this fixed, we'll handle the dependent bug a little later.
Assignee: shyam → ashish
Status: NEW → RESOLVED
Closed: 13 years ago
Depends on: 729062
Resolution: --- → FIXED
(In reply to Shyam Mani [:fox2mike] from comment #7) > Which means these mirrors were only treating dm-vcview02 as their "source". > Now I'm curious to know who put that /etc/hosts entry on these mirrors :) I did. And fwiw, hg1.build.scl1 is out of commission right now until we can figure out how to deal with /try on a newer hg. We've just outgrown the capabilities of hg's new header commands when trying to clone and pull that repo.
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.