Closed Bug 937732 Opened 11 years ago Closed 9 years ago

Tracker bug: HG local disk migration

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bkero, Assigned: bkero)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1067] )

Attachments

(1 file)

This bug should document the effort to migrate hg hosts to using local disk.
Depends on: 937739
No longer depends on: 937720
Depends on: 937720
Assignee: server-ops-devservices → bkero
Depends on: 781923
Depends on: 948159
I've made my attempt at solving the last piece of the puzzle. The pushing user needs access to run very specific command via sudo.

I've asked :kang to review the two sudo lines along with the contents of the script being executed, to make there there are no possible ways for nefarious arguments to sneak through.

This, and adding the post-commit hook to make this happen should be the only things preventing pushing this off to releng to test.
I've committed the two sudo lines, so now that works:

Cmnd_Alias REPOPUSH = /usr/local/bin/repo-push.sh [a-zA-Z0-9/]*
%scm_level_1 ALL = (hg) NOPASSWD: REPOPUSH

Additionally I've followed recommendations and have started shipping stdout/stderr logs to the syslog facilities on the hosts with:

2>&1 | /usr/bin/logger -t "repo-push.sh"

I've created a basic hook that can be used for pushing repositories.

$ cat /repo/hg/scripts/push-repo.sh

#!/usr/bin/env bash

sudo -u hg /usr/local/bin/repo-push.sh $(echo ${PWD/\/repo\/hg\/mozilla\/})

Additionally I've created a temporary repo to test this script, and all seems well. Log output to follow.

$ hg clone ssh://bkero%40mozilla.com@hg.mozilla.org/hgcustom/hgmirror
$ cd hgmirror
$ vim README 
$ hg commit -m whee
$ hg push
pushing to ssh://bkero%40mozilla.com@hg.mozilla.org/hgcustom/hgmirror
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 1 changes to 1 files
remote: Trying to insert into pushlog.
remote: Please do not interrupt...
remote: Inserted into the pushlog db successfully.
The next step for this is to enable this extension globally (in /etc/mercurial/hgrc). I've confirmed it working for individual repositories. The extension is 'push-repo.sh' listed in the last comment.

I expect the impact of this change to be minimal. When would be a good time for me to coordinate with everybody interested to put this into production, or would folks like more offline testing for this?

This is the last step to having fully functional local-disk hg mirrors.
If there's a Mercurial extension installed, would you like me to take a gander at the source to verify it looks good?
The extension (hook really) source was pasted in comment #2, under 'cat repo-push.sh'. It's quite simple. Later if required I can add a daemon and queue system. After reviewing the frequency of pushes that happen, this probably won't be required unless things get much busier.
needinfo on myself to read bug (and dep bug) state.
Flags: needinfo?(hwine)
Flags: needinfo?(bugspam.Callek)
:bkero -- where's the source for /usr/local/bin/repo-push.sh ?

That is part of the hook, even if it's not packaged with the hook. That's what Greg needs, and I'd like a gander as well.

After we've reviewed that, we can discuss next steps, as we'll understand the scope.
Flags: needinfo?(hwine) → needinfo?(bkero)
Here is the source for /usr/local/bin/repo-push.sh:

#!/bin/bash

for host in $(cat /etc/mercurial/mirrors)
do
    /usr/bin/logger -t 'repo-push.sh' "pushing $* to host $host"
    /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts $host -- "$*" 2>&1 | /usr/bin/logger -t "repo-push.sh"
done

The StictHostKeyChecking can be removed once we've figured out a system for auditing host SSH keys. This is an IT-wide desire from secops (if not goal).
Flags: needinfo?(bkero)
One host has been put into production. It's dealing with the workload just fine. Now the other webheads need to follow suit.
Before we add another webhead to the system, can we get a comment on bug 970487 comment 1, please? How will we prevent that issue as we convert more webheads?
Depends on: 970487
(In reply to Hal Wine [:hwine] (use needinfo) from comment #10)
> Before we add another webhead to the system, can we get a comment on bug
> 970487 comment 1, please? How will we prevent that issue as we convert more
> webheads?

This was answered in bug 970487 comment 2, so not a blocker to proceeding. Thanks!
No longer depends on: 970487
I've run into hg.mozilla.org not being in sync and giving inconcistent replies tonight and today. I've seen lags in the range of hours.

When I pulled http://hg.mozilla.org/integration/gaia-central/ today, it had no changes since yesterday, for example. Trying again got the changes.

I also get way more 500 errors now.
Flags: needinfo?(bugspam.Callek)
(In reply to Axel Hecht [:Pike] from comment #12)
> I've run into hg.mozilla.org not being in sync and giving inconcistent
> replies tonight and today. I've seen lags in the range of hours.
> 
> When I pulled http://hg.mozilla.org/integration/gaia-central/ today, it had
> no changes since yesterday, for example. Trying again got the changes.
> 
> I also get way more 500 errors now.

I've been getting a lot of ISE 500s as well, in the new vcs-sync emails.
Rough estimate: I was getting ~7 hg.m.o ISE 500 emails per day before Monday; since Monday ~15.
Anyone know what the HTTP load balancer configuration for hg.mozilla.org is w.r.t. multiple requests on the same HTTP/1.1 connection? Will the load balancer "pin" clients to the same origin server or could subsequent HTTP requests hit separate nodes?

I ask because Mercurial's push and pull operations currently consist of multiple HTTP requests. It's possible one request will go to an up-to-date mirror while a subsequent will hit an out-of-date mirror. This could result in client breakage.

We likely didn't have this issue with NFS since the filesystems were all in sync.
(In reply to Gregory Szorc [:gps] from comment #14)
> Anyone know what the HTTP load balancer configuration for hg.mozilla.org is
> w.r.t. multiple requests on the same HTTP/1.1 connection? Will the load
> balancer "pin" clients to the same origin server or could subsequent HTTP
> requests hit separate nodes?

The configuration is the default, so yes, it's absolutely possible that multiple requests, even
within a single keepalive connection from the client's point of view would be routed to different
backend http servers.

> I ask because Mercurial's push and pull operations currently consist of
> multiple HTTP requests. It's possible one request will go to an up-to-date
> mirror while a subsequent will hit an out-of-date mirror. This could result
> in client breakage.

And yes, that's absolutely possible in the current setup.

> We likely didn't have this issue with NFS since the filesystems were all in
> sync.

Probably a correct assumption.
Depends on: 972527
however c#14 and c#15 don't explain multi-hour lag (compared with actual ssh://hg.m.o/) like :Pike was seeing
I identified and fixed the cause of the lag this morning. As part of the hg module, a sudoers file is required to tell the master server to push the new changes out to the hgweb hosts. In this sudoers file I granted the 'scm_level_1' group permission to execute the command, however scm_level_2, 3, and scm_l10n were silently failing (if they had a pty they would be showing a password prompt).

I fixed this earlier this morning in SVN so this shouldn't happen again.
My proposed schedule for cutting over remaining hg webheads is as follows:

2014-02-18: Create and submit CAB proposal for converting hgweb[2-5] to local disk.
2014-02-19: Attend CAB meeting and present deployment schedule
2014-02-20 1300 PST: Remove hgweb[2-5] from load balancer, rebuild hosts, puppetize, rsync repos
2014-02-20 1600 PST: Re-add hosts to load balancer
2014-02-20 to 24: Monitor performance and availability problems
2014-02-24 0900 PST: Remove hgweb[6-8] hosts from load balancer, rebuild, puppetize, rsync repos
2014-02-24 1300 PST: Re-add hosts to load balancer
2014-02-24 to 28: Monitor performance and availability problems.
Revision: Convert hgweb[2-4] on 2014-02-18 and hgweb[5-8] on 2014-02-24
Depends on: 974106
Please see attached email of recent HTTP 500 errors...

It looks like there may still be issues?

Thanks,
Pete
bug 974647 comment 4 reports an issue with new repository propagation time. Do any of our procedures with new hg repositories need to be changed?
Flags: needinfo?(bkero)
Depends on: 983085
Depends on: 1015823
Depends on: 1016778
Depends on: 1036244
Depends on: 1036998
Component: Server Operations: Developer Services → Mercurial: hg.mozilla.org
Product: mozilla.org → Developer Services
With my information now the question seems ambiguous. Is the concern about:

1) A new empty repository is created. The user pushes their large history of changegroups to the repository, causing a long initial cloning time.

Or

2) An empty repository is created on the SSH master. It does not appear on the webheads until an initial push has been done.


Concern number 1 is a property of how we sync these repositories out. Is this something worth the effort to engineer away?

Concern number 2 was addressed by adding documentation to our common procedure documents to call the script to sync the new empty repository to the webheads. It can be seen here:

https://mana.mozilla.org/wiki/display/SYSADMIN/Mercurial+-+Common+Repository+Operations#Mercurial-CommonRepositoryOperations-Creatinganewrepository
Flags: needinfo?(bkero)
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/83]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/83] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1057] [kanban:engops:https://kanbanize.com/ctrl_board/6/83]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1057] [kanban:engops:https://kanbanize.com/ctrl_board/6/83] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1065] [kanban:engops:https://kanbanize.com/ctrl_board/6/83]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1065] [kanban:engops:https://kanbanize.com/ctrl_board/6/83] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1067] [kanban:engops:https://kanbanize.com/ctrl_board/6/83]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1067] [kanban:engops:https://kanbanize.com/ctrl_board/6/83] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1067]
The work for this has all been done. Closing out.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: