Closed Bug 714490 Opened 13 years ago Closed 12 years ago

hg(1&2).build.scl1:Mercurial mirror sync - /mozilla-central is CRITICAL: repo /mozilla-central is out of sync

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: catlee)

Details

(Whiteboard: [hg])

Attachments

(3 files)

I figured this was just a crap nagios alert, like so many of them are, but eventually I noticed that my merge to mozilla-central only claimed to be building on maybe two thirds of the platforms, and then as time went by those that it claimed to be building on started disappearing.

mozilla-central is closed.
Cute, my builds weren't disappearing, they were just building against the push before mine since they couldn't find mine to build on.

And I bet I meant relops rather than releng.
Assignee: nobody → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
(In reply to Phil Ringnalda (:philor) from comment #1)

> And I bet I meant relops rather than releng.

Yep. 


fyi: I've paged IT oncall.
per irc w/zandr, this belongs in ServerOps, so moving.
Assignee: server-ops-releng → server-ops
Component: Server Operations: RelEng → Server Operations
QA Contact: zandr → cshields
20:54:43 < bkero> justdave: if it happens again, the solution is to log into whichever server is complaining, and issue a command similar to: 'su hg -c "/usr/local/bin/mirror-pull mozilla-central"'
Assignee: server-ops → bkero
Status: NEW → RESOLVED
Closed: 13 years ago
Component: Server Operations → Release Engineering
Resolution: --- → FIXED
dustin wants this tossed to him for some followup
Assignee: bkero → dustin
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
QA Contact: cshields → release
fyi: philor reopened trees at 21:53 PST, after he+I verified that a simple whitespace patch was building with the correct changesets.
Attached file cloning log snippet
Just in case the grim reaper comes for the logs before anyone can get back to it.

Looks like it was doing the best it could in trying circumstances, fell back to hg.m.o but got a premature EOF reading a chunk, but then when it fell back to downloading a bundle, it updated the bundle from the mirror. I'm sure I haven't thought through everything, but that seems a little odd, since it already failed to get what it wanted from the mirror.

Given my personal experiences with hg.m.o, I would have retried pulling from it at least 5 times before I moved on to other things :)
Severity: blocker → critical
Attached file success log
One more, because it's cute: this is the 10.5 debug build on my original push, which managed to actually build on that rev. Mirror failed, fell back to hg.m.o, that failed, grabbed a bundle and updated it from the mirror and tried to update to the rev it wanted, that failed, grabbed a bundle again, updated it from the mirror again, tried to say screw it and just hg up -C and take whatever it got, but *that* timed out, fell back from that to pulling from hg.m.o, that succeeded but it called success failure and started from scratch, pulling from the mirror failed, it fell back to pulling from hg.m.o and since it had actually already gotten the rev from hg.m.o already, no changes, hg up -r, success.
I'm tossing this to the releng queue, and keeping the 'critical', for analysis of potential lost resiliency.

It's important that the Buildbot equipment be resilient to mirror failures, and for the most part this has been true.  So the first question for releng is, has something changed here to break that resiliency.

From reading the logs and comment 7, this doesn't appear to be the case -- but someone with more background should verify.  Rather, the script fell back to a clone from hg.m.o/m-c as expected, but when *that* failed, did not fall back appropriately from there.

The second question is about the fallback from the hg.m.o failure: there's a logic error in the script that ended up with the wrong bundle.
Assignee: dustin → nobody
(In reply to Dustin J. Mitchell [:dustin] from comment #9)
> I'm tossing this to the releng queue, and keeping the 'critical', for
> analysis of potential lost resiliency.
> 
> It's important that the Buildbot equipment be resilient to mirror failures,
> and for the most part this has been true.  So the first question for releng
> is, has something changed here to break that resiliency.

I think this is the first time that both the mirror and the primary have fallen over.

> From reading the logs and comment 7, this doesn't appear to be the case --
> but someone with more background should verify.  Rather, the script fell
> back to a clone from hg.m.o/m-c as expected, but when *that* failed, did not
> fall back appropriately from there.

I'm not sure exactly what *should* happen here, especially when we have to cope with intermittent network failures as well as persistent server failures. This is what happened in this log:

1) try and pull the requested revision from the mirror into our shared checkout. the mirror was out of sync, so this fails.
2) try and pull the requested revision from the master into our shared checkout. we hit an intermittent (?) network issue, or perhaps the bad response is cached by varnish? in any case, this also fails.
3) assume we're busted and clobber our shared checkout
4) initialize shared checkout with the bundle
5) pull new changes from the mirror into the shared checkout
6) update working copy to requested revision. since we're working with a shared checkout, this should work, if the mirror is in sync.
7) this fails, so give up on the shared repo. initialize our current working directory with the bundle
8) pull new changes from mirror into our working copy
9) update to tip (for some reason...I think this is the logic error referred to below that needs to be fixed)
10) report success even though we're at the wrong revision

> The second question is about the fallback from the hg.m.o failure: there's a
> logic error in the script that ended up with the wrong bundle.

How so? It tried http://ftp.mozilla.org/pub/mozilla.org/firefox/bundles/mozilla-central.hg both times, which is the correct bundle to use.

The logic error I do see is that after various levels of falling back, it ends up settling on revision a447e66c3174 instead of the requested da6c33eb4b1646591a7e232d437713cb0366a33c.
Assignee: nobody → catlee
Severity: critical → major
Priority: -- → P2
Whiteboard: [hg]
After unbundling, we should call pull with revision and branch. This ensures that if we can't get the revision we want, we'll fail instead of returning success but being on the wrong revision.
Attachment #587769 - Flags: review?(bhearsum)
Attachment #587769 - Flags: review?(bhearsum) → review+
Attachment #587769 - Flags: checked-in+
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: