Closed Bug 714490 Opened 13 years ago Closed 12 years ago

hg(1&2).build.scl1:Mercurial mirror sync - /mozilla-central is CRITICAL: repo /mozilla-central is out of sync

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: philor, Assigned: catlee)

Details

(Whiteboard: [hg])

Attachments

(3 files)

cloning log snippet 13 years ago Phil Ringnalda (:philor) 8.98 KB, text/plain		Details
success log 13 years ago Phil Ringnalda (:philor) 15.15 KB, text/plain		Details
update to specified branch/revision after unbundling 12 years ago Chris AtLee [:catlee] 2.53 KB, patch	bhearsum : review+ catlee : checked-in+	Details \| Diff \| Splinter Review

Phil Ringnalda (:philor)

Reporter

Description

•

13 years ago

I figured this was just a crap nagios alert, like so many of them are, but eventually I noticed that my merge to mozilla-central only claimed to be building on maybe two thirds of the platforms, and then as time went by those that it claimed to be building on started disappearing.

mozilla-central is closed.

Phil Ringnalda (:philor)

Reporter

Comment 1

•

13 years ago

Cute, my builds weren't disappearing, they were just building against the push before mine since they couldn't find mine to build on.

And I bet I meant relops rather than releng.

Assignee: nobody → server-ops-releng

Component: Release Engineering → Server Operations: RelEng

QA Contact: release → zandr

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 2

•

13 years ago

(In reply to Phil Ringnalda (:philor) from comment #1)

> And I bet I meant relops rather than releng.

Yep. 


fyi: I've paged IT oncall.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 3

•

13 years ago

per irc w/zandr, this belongs in ServerOps, so moving.

Assignee: server-ops-releng → server-ops

Component: Server Operations: RelEng → Server Operations

QA Contact: zandr → cshields

Dave Miller [:justdave]

Comment 4

•

13 years ago

20:54:43 < bkero> justdave: if it happens again, the solution is to log into whichever server is complaining, and issue a command similar to: 'su hg -c "/usr/local/bin/mirror-pull mozilla-central"'

Assignee: server-ops → bkero

Status: NEW → RESOLVED

Closed: 13 years ago

Component: Server Operations → Release Engineering

Resolution: --- → FIXED

Dave Miller [:justdave]

Comment 5

•

13 years ago

dustin wants this tossed to him for some followup

Assignee: bkero → dustin

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Phil Ringnalda (:philor)

Reporter

Updated

•

13 years ago

QA Contact: cshields → release

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

13 years ago

fyi: philor reopened trees at 21:53 PST, after he+I verified that a simple whitespace patch was building with the correct changesets.

Phil Ringnalda (:philor)

Reporter

Comment 7

•

13 years ago

Attached file cloning log snippet — Details

Just in case the grim reaper comes for the logs before anyone can get back to it.

Looks like it was doing the best it could in trying circumstances, fell back to hg.m.o but got a premature EOF reading a chunk, but then when it fell back to downloading a bundle, it updated the bundle from the mirror. I'm sure I haven't thought through everything, but that seems a little odd, since it already failed to get what it wanted from the mirror.

Given my personal experiences with hg.m.o, I would have retried pulling from it at least 5 times before I moved on to other things :)

Phil Ringnalda (:philor)

Reporter

Updated

•

13 years ago

Severity: blocker → critical

Phil Ringnalda (:philor)

Reporter

Comment 8

•

13 years ago

Attached file success log — Details

One more, because it's cute: this is the 10.5 debug build on my original push, which managed to actually build on that rev. Mirror failed, fell back to hg.m.o, that failed, grabbed a bundle and updated it from the mirror and tried to update to the rev it wanted, that failed, grabbed a bundle again, updated it from the mirror again, tried to say screw it and just hg up -C and take whatever it got, but *that* timed out, fell back from that to pulling from hg.m.o, that succeeded but it called success failure and started from scratch, pulling from the mirror failed, it fell back to pulling from hg.m.o and since it had actually already gotten the rev from hg.m.o already, no changes, hg up -r, success.

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

13 years ago

I'm tossing this to the releng queue, and keeping the 'critical', for analysis of potential lost resiliency.

It's important that the Buildbot equipment be resilient to mirror failures, and for the most part this has been true.  So the first question for releng is, has something changed here to break that resiliency.

From reading the logs and comment 7, this doesn't appear to be the case -- but someone with more background should verify.  Rather, the script fell back to a clone from hg.m.o/m-c as expected, but when *that* failed, did not fall back appropriately from there.

The second question is about the fallback from the hg.m.o failure: there's a logic error in the script that ended up with the wrong bundle.

Assignee: dustin → nobody

Chris AtLee [:catlee]

Assignee

Comment 10

•

13 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #9)
> I'm tossing this to the releng queue, and keeping the 'critical', for
> analysis of potential lost resiliency.
> 
> It's important that the Buildbot equipment be resilient to mirror failures,
> and for the most part this has been true.  So the first question for releng
> is, has something changed here to break that resiliency.

I think this is the first time that both the mirror and the primary have fallen over.

> From reading the logs and comment 7, this doesn't appear to be the case --
> but someone with more background should verify.  Rather, the script fell
> back to a clone from hg.m.o/m-c as expected, but when *that* failed, did not
> fall back appropriately from there.

I'm not sure exactly what *should* happen here, especially when we have to cope with intermittent network failures as well as persistent server failures. This is what happened in this log:

1) try and pull the requested revision from the mirror into our shared checkout. the mirror was out of sync, so this fails.
2) try and pull the requested revision from the master into our shared checkout. we hit an intermittent (?) network issue, or perhaps the bad response is cached by varnish? in any case, this also fails.
3) assume we're busted and clobber our shared checkout
4) initialize shared checkout with the bundle
5) pull new changes from the mirror into the shared checkout
6) update working copy to requested revision. since we're working with a shared checkout, this should work, if the mirror is in sync.
7) this fails, so give up on the shared repo. initialize our current working directory with the bundle
8) pull new changes from mirror into our working copy
9) update to tip (for some reason...I think this is the logic error referred to below that needs to be fixed)
10) report success even though we're at the wrong revision

> The second question is about the fallback from the hg.m.o failure: there's a
> logic error in the script that ended up with the wrong bundle.

How so? It tried http://ftp.mozilla.org/pub/mozilla.org/firefox/bundles/mozilla-central.hg both times, which is the correct bundle to use.

The logic error I do see is that after various levels of falling back, it ends up settling on revision a447e66c3174 instead of the requested da6c33eb4b1646591a7e232d437713cb0366a33c.

Chris AtLee [:catlee]

Assignee

Updated

•

13 years ago

Assignee: nobody → catlee

Severity: critical → major

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Priority: -- → P2

Whiteboard: [hg]

Chris AtLee [:catlee]

Assignee

Comment 11

•

12 years ago

Attached patch update to specified branch/revision after unbundling — Details — Splinter Review

After unbundling, we should call pull with revision and branch. This ensures that if we can't get the revision we want, we'll fail instead of returning success but being on the wrong revision.

Attachment #587769 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

12 years ago

Attachment #587769 - Flags: review?(bhearsum) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Attachment #587769 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Status: REOPENED → RESOLVED

Closed: 13 years ago → 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.