hg2.build.scl1 & hg1.build.scl1 have several repos which have 'stale sync data'



7 years ago
4 years ago


(Reporter: nthomas, Assigned: ashish)





7 years ago
eg Mercurial mirror sync - /try - sync data is stale. 11311 seconds 

full set - https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=hg2.build.scl1

For context hg2 is used as hg.build.m.o, which saves hg.m.o from load from the RelEng build & test slaves. They'll fail over to hg.m.o if they can't get the revision they want from the mirror.

hg1 isn't in use, AFAIK.


7 years ago
Assignee: server-ops → server-ops-devservices
Component: Server Operations → Server Operations: Developer Services
QA Contact: mrz → shyam

Comment 1

7 years ago
FYI, here's what a linux compile slaves gets:

command: START
command: hg pull -r 4b2b0f2b92b8d2035242b3c1bd068106d8042fbc http://hg.build.scl1.mozilla.com/integration/mozilla-inbound
command: cwd: /builds/hg-shared/integration/mozilla-inbound
command: output:
abort: HTTP Error 403: Forbidden
command: END (0.23s elapsed)

then it falls back to hg.m.o.

Comment 2

7 years ago
Bug 729033 for comment #1, turns out it's anything not in the root.
cshields: Is this related to bug#718533 ?

I believe this will cause load to bypass mirrors and flow directly to hg.m.o, which may become a production problem once developer checkin load ramps up in the morning. 
Leaving as "normal" for now, but cc-ing Rail (buildduty tmrw) so he can keep an eye on it.
I think there might be two different issues here :

1) Data is stale. 

There is a yaml file in /dev/shm/check_hg_mirrorsync/state which is what the nagios check uses to determine "staleness" 

The check is hitting this condition right now : 

if(data_age > max_data_age):
    print "sync data is stale. %i seconds" % data_age

Which means that state file hasn't been touched in so long. I've tried kicking the mirror processes, but I'm not sure that really achieved anything.

Nothing is holding the state file open either :

[root@hg2 ~]# lsof /dev/shm/check_hg_mirrorsync/state
[root@hg2 ~]#

I don't want to blow it away just yet, I'm still looking at what creates the file.

2) The slave is failing because the repo is either incorrectly configured or has some other issue. I haven't investigated further since I'm looking at the stale data issue first.
So mirror-daemon which runs on dm-svn02 says :

2012-02-21 02:26:23.489236500 Spawned 1 processes, 0 pending
2012-02-21 02:26:23.489239500 in reap_children, nchildren = 1
2012-02-21 02:26:31.489212500 ERROR: Push of /integration/mozilla-inbound to hg@hg2.build.scl1.mozilla.com returned 255
2012-02-21 02:26:31.489215500 Output: integration/mozilla-inbound already exists, pulling
2012-02-21 02:26:31.489216500 abort: error: Connection refused

Debugging that further.
Assignee: server-ops-devservices → shyam
Forcing a push doesn't work either :

[hg@dm-svn02 ~]$ /repo/hg/libraries/mozhghooks/push_repo.py -r /integration/mozilla-inbound -H hg@hg2.build.scl1.mozilla.com
Spawned [/usr/bin/ssh -n -i/etc/mercurial/ssh/id_rsa hg@hg2.build.scl1.mozilla.com hg pull /integration/mozilla-inbound] as pid 23652
Enter passphrase for key '/etc/mercurial/ssh/id_rsa': 
Enter passphrase for key '/etc/mercurial/ssh/id_rsa': 
Enter passphrase for key '/etc/mercurial/ssh/id_rsa': 
Job finished with code 255. Output follows:
Permission denied (publickey,gssapi-with-mic).

Seems like the ssh key is password protected, I'm not sure why that would be the case though...and why this changed overnight.
So we debugged some more and ran into :

[03:48:48] <@   fox2mike> | [hg@hg2 ~]$ /usr/local/bin/mirror-pull /integration/mozilla-inbound
[03:48:48] <@   fox2mike> | integration/mozilla-inbound already exists, pulling
[03:48:49] <@   fox2mike> | abort: error: Connection refused

And then we checked the zeus pool etc...but ashish hit the jackpot with :

[04:01:30] <      ashish> | [hg@hg2 ~]$ telnet hg.mozilla.org 80
[04:01:31] <      ashish> | Trying
[04:01:31] <      ashish> | telnet: connect to address Connection refused
[04:01:31] <      ashish> | telnet: Unable to connect to remote host: Connection refused
[04:01:38] <      ashish> | domain name pointer dm-vcview02.mozilla.org.
[04:02:38] <      ashish> | how/why is telnet only connecting to .69?
[04:02:48] <      ashish> |      hg.mozilla.org
[04:02:50] <      ashish> | bwahahaha
[04:02:52] <      ashish> | /etc/hosts
[04:03:03] <     ashish> | fox2mike: ^^
[04:03:28] <      ashish> | -rw-r--r-- 1 root root 281 Jan 30 08:59 /etc/hosts

Which means these mirrors were only treating dm-vcview02 as their "source". Now I'm curious to know who put that /etc/hosts entry on these mirrors :)

dm-vcview02 and 03 have been down for the last 7 hours because of bug 729062. We applied a temp fix for 01 and got that back online but were looking at debugging 02 and 03. This is the primary reason why these mirrors were lagging. The /etc/hosts entry *prevented* this from recovering much earlier and in fact was the cause of this issue...there were other functional hosts in the pool that would have ensured the mirrors never saw this..but the /etc/hosts entry ensured we kept hitting the failed node. I'm marking this fixed, we'll handle the dependent bug a little later.
Assignee: shyam → ashish
Last Resolved: 7 years ago
Depends on: 729062
Resolution: --- → FIXED
(In reply to Shyam Mani [:fox2mike] from comment #7)
> Which means these mirrors were only treating dm-vcview02 as their "source".
> Now I'm curious to know who put that /etc/hosts entry on these mirrors :)

I did. 

And fwiw, hg1.build.scl1 is out of commission right now until we can figure out how to deal with /try on a newer hg.  We've just outgrown the capabilities of hg's new header commands when trying to clone and pull that repo.
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.