937732 - Tracker bug: HG local disk migration

I've made my attempt at solving the last piece of the puzzle. The pushing user needs access to run very specific command via sudo. I've asked :kang to review the two sudo lines along with the contents of the script being executed, to make there there are no possible ways for nefarious arguments to sneak through. This, and adding the post-commit hook to make this happen should be the only things preventing pushing this off to releng to test.

Ben Kero [:bkero]

Assignee

Comment 2

•

12 years ago

I've committed the two sudo lines, so now that works: Cmnd_Alias REPOPUSH = /usr/local/bin/repo-push.sh [a-zA-Z0-9/]* %scm_level_1 ALL = (hg) NOPASSWD: REPOPUSH Additionally I've followed recommendations and have started shipping stdout/stderr logs to the syslog facilities on the hosts with: 2>&1 | /usr/bin/logger -t "repo-push.sh" I've created a basic hook that can be used for pushing repositories. $ cat /repo/hg/scripts/push-repo.sh #!/usr/bin/env bash sudo -u hg /usr/local/bin/repo-push.sh $(echo ${PWD/\/repo\/hg\/mozilla\/}) Additionally I've created a temporary repo to test this script, and all seems well. Log output to follow. $ hg clone ssh://bkero%40mozilla.com@hg.mozilla.org/hgcustom/hgmirror $ cd hgmirror $ vim README $ hg commit -m whee $ hg push pushing to ssh://bkero%40mozilla.com@hg.mozilla.org/hgcustom/hgmirror searching for changes remote: adding changesets remote: adding manifests remote: adding file changes remote: added 1 changesets with 1 changes to 1 files remote: Trying to insert into pushlog. remote: Please do not interrupt... remote: Inserted into the pushlog db successfully.

Ben Kero [:bkero]

Assignee

Comment 3

•

12 years ago

The next step for this is to enable this extension globally (in /etc/mercurial/hgrc). I've confirmed it working for individual repositories. The extension is 'push-repo.sh' listed in the last comment. I expect the impact of this change to be minimal. When would be a good time for me to coordinate with everybody interested to put this into production, or would folks like more offline testing for this? This is the last step to having fully functional local-disk hg mirrors.

Gregory Szorc [:gps]

Comment 4

•

12 years ago

If there's a Mercurial extension installed, would you like me to take a gander at the source to verify it looks good?

Ben Kero [:bkero]

Assignee

Comment 5

•

12 years ago

The extension (hook really) source was pasted in comment #2, under 'cat repo-push.sh'. It's quite simple. Later if required I can add a daemon and queue system. After reviewing the frequency of pushes that happen, this probably won't be required unless things get much busier.

Justin Wood (:Callek)

Comment 6

•

12 years ago

needinfo on myself to read bug (and dep bug) state.

Flags: needinfo?(hwine)

Flags: needinfo?(bugspam.Callek)

hwine

Comment 7

•

12 years ago

:bkero -- where's the source for /usr/local/bin/repo-push.sh ? That is part of the hook, even if it's not packaged with the hook. That's what Greg needs, and I'd like a gander as well. After we've reviewed that, we can discuss next steps, as we'll understand the scope.

Flags: needinfo?(hwine) → needinfo?(bkero)

Ben Kero [:bkero]

Assignee

Comment 8

•

12 years ago

Here is the source for /usr/local/bin/repo-push.sh: #!/bin/bash for host in $(cat /etc/mercurial/mirrors) do /usr/bin/logger -t 'repo-push.sh' "pushing $* to host $host" /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts $host -- "$*" 2>&1 | /usr/bin/logger -t "repo-push.sh" done The StictHostKeyChecking can be removed once we've figured out a system for auditing host SSH keys. This is an IT-wide desire from secops (if not goal).

Flags: needinfo?(bkero)

Ben Kero [:bkero]

Assignee

Comment 9

•

12 years ago

One host has been put into production. It's dealing with the workload just fine. Now the other webheads need to follow suit.

hwine

Comment 10

•

12 years ago

Before we add another webhead to the system, can we get a comment on bug 970487 comment 1, please? How will we prevent that issue as we convert more webheads?

Depends on: 970487

hwine

Comment 11

•

12 years ago

(In reply to Hal Wine [:hwine] (use needinfo) from comment #10) > Before we add another webhead to the system, can we get a comment on bug > 970487 comment 1, please? How will we prevent that issue as we convert more > webheads? This was answered in bug 970487 comment 2, so not a blocker to proceeding. Thanks!

No longer depends on: 970487

Axel Hecht [:Pike]

Comment 12

•

12 years ago

I've run into hg.mozilla.org not being in sync and giving inconcistent replies tonight and today. I've seen lags in the range of hours. When I pulled http://hg.mozilla.org/integration/gaia-central/ today, it had no changes since yesterday, for example. Trying again got the changes. I also get way more 500 errors now.

Justin Wood (:Callek)

Updated

•

12 years ago

Flags: needinfo?(bugspam.Callek)

Aki Sasaki (not active)

Comment 13

•

12 years ago

(In reply to Axel Hecht [:Pike] from comment #12) > I've run into hg.mozilla.org not being in sync and giving inconcistent > replies tonight and today. I've seen lags in the range of hours. > > When I pulled http://hg.mozilla.org/integration/gaia-central/ today, it had > no changes since yesterday, for example. Trying again got the changes. > > I also get way more 500 errors now. I've been getting a lot of ISE 500s as well, in the new vcs-sync emails. Rough estimate: I was getting ~7 hg.m.o ISE 500 emails per day before Monday; since Monday ~15.

Gregory Szorc [:gps]

Comment 14

•

12 years ago

Anyone know what the HTTP load balancer configuration for hg.mozilla.org is w.r.t. multiple requests on the same HTTP/1.1 connection? Will the load balancer "pin" clients to the same origin server or could subsequent HTTP requests hit separate nodes? I ask because Mercurial's push and pull operations currently consist of multiple HTTP requests. It's possible one request will go to an up-to-date mirror while a subsequent will hit an out-of-date mirror. This could result in client breakage. We likely didn't have this issue with NFS since the filesystems were all in sync.

Philippe M. Chiasson (:gozer)

Comment 15

•

12 years ago

(In reply to Gregory Szorc [:gps] from comment #14) > Anyone know what the HTTP load balancer configuration for hg.mozilla.org is > w.r.t. multiple requests on the same HTTP/1.1 connection? Will the load > balancer "pin" clients to the same origin server or could subsequent HTTP > requests hit separate nodes? The configuration is the default, so yes, it's absolutely possible that multiple requests, even within a single keepalive connection from the client's point of view would be routed to different backend http servers. > I ask because Mercurial's push and pull operations currently consist of > multiple HTTP requests. It's possible one request will go to an up-to-date > mirror while a subsequent will hit an out-of-date mirror. This could result > in client breakage. And yes, that's absolutely possible in the current setup. > We likely didn't have this issue with NFS since the filesystems were all in > sync. Probably a correct assumption.

Gregory Szorc [:gps]

Updated

•

12 years ago

Depends on: 972527

Justin Wood (:Callek)

Comment 16

•

12 years ago

however c#14 and c#15 don't explain multi-hour lag (compared with actual ssh://hg.m.o/) like :Pike was seeing

Ben Kero [:bkero]

Assignee

Comment 17

•

12 years ago

I identified and fixed the cause of the lag this morning. As part of the hg module, a sudoers file is required to tell the master server to push the new changes out to the hgweb hosts. In this sudoers file I granted the 'scm_level_1' group permission to execute the command, however scm_level_2, 3, and scm_l10n were silently failing (if they had a pty they would be showing a password prompt). I fixed this earlier this morning in SVN so this shouldn't happen again.

Ben Kero [:bkero]

Assignee

Comment 18

•

12 years ago

My proposed schedule for cutting over remaining hg webheads is as follows: 2014-02-18: Create and submit CAB proposal for converting hgweb[2-5] to local disk. 2014-02-19: Attend CAB meeting and present deployment schedule 2014-02-20 1300 PST: Remove hgweb[2-5] from load balancer, rebuild hosts, puppetize, rsync repos 2014-02-20 1600 PST: Re-add hosts to load balancer 2014-02-20 to 24: Monitor performance and availability problems 2014-02-24 0900 PST: Remove hgweb[6-8] hosts from load balancer, rebuild, puppetize, rsync repos 2014-02-24 1300 PST: Re-add hosts to load balancer 2014-02-24 to 28: Monitor performance and availability problems.

Ben Kero [:bkero]

Assignee

Comment 19

•

12 years ago

Revision: Convert hgweb[2-4] on 2014-02-18 and hgweb[5-8] on 2014-02-24

hwine

Updated

•

12 years ago

Depends on: 974106

Pete Moore [:pmoore][:pete]

Comment 20

•

11 years ago

Attached file vcs2vcs Successful noop conversion for buildrepos with warnings 3s.pdf — Details

Please see attached email of recent HTTP 500 errors... It looks like there may still be issues? Thanks, Pete

hwine

Comment 21

•

11 years ago

bug 974647 comment 4 reports an issue with new repository propagation time. Do any of our procedures with new hg repositories need to be changed?

Flags: needinfo?(bkero)

Nick Thomas [:nthomas] (UTC+12)

Updated

•

11 years ago

Depends on: 983085

John Daggett (:jtd)

Updated

•

11 years ago

Depends on: 1015823

hwine

Updated

•

11 years ago

Depends on: 1016778

hwine

Updated

•

11 years ago

Depends on: 1036244

hwine

Updated

•

11 years ago

Depends on: 1036998

Kendall Libby [:fubar] (he/him)

Updated

•

11 years ago

Component: Server Operations: Developer Services → Mercurial: hg.mozilla.org

Product: mozilla.org → Developer Services

Ben Kero [:bkero]

Assignee

Comment 22

•

11 years ago

With my information now the question seems ambiguous. Is the concern about: 1) A new empty repository is created. The user pushes their large history of changegroups to the repository, causing a long initial cloning time. Or 2) An empty repository is created on the SSH master. It does not appear on the webheads until an initial push has been done. Concern number 1 is a property of how we sync these repositories out. Is this something worth the effort to engineer away? Concern number 2 was addressed by adding documentation to our common procedure documents to call the script to sync the new empty repository to the webheads. It can be seen here: https://mana.mozilla.org/wiki/display/SYSADMIN/Mercurial+-+Common+Repository+Operations#Mercurial-CommonRepositoryOperations-Creatinganewrepository

Flags: needinfo?(bkero)

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/83]

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/83] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1057] [kanban:engops:https://kanbanize.com/ctrl_board/6/83]

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1057] [kanban:engops:https://kanbanize.com/ctrl_board/6/83] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1065] [kanban:engops:https://kanbanize.com/ctrl_board/6/83]

:kanban-engops

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1065] [kanban:engops:https://kanbanize.com/ctrl_board/6/83] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1067] [kanban:engops:https://kanbanize.com/ctrl_board/6/83]

Nobody; OK to take it and work on it

Updated

•

11 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1067] [kanban:engops:https://kanbanize.com/ctrl_board/6/83] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1067]

Ben Kero [:bkero]

Assignee

Comment 23

•

10 years ago

The work for this has all been done. Closing out.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED