Closed Bug 824979 Opened 12 years ago Closed 11 years ago

Create SSH keypair(s) for automated replica creation

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P3)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: selenamarie, Unassigned)

References

Details

(Whiteboard: [triaged 20130104])

This request is to get the following SSH keys created and installed for the postgres user:

* SSH access from socorro1.dev to tp-socorro01-master02 (already appears to have access to tp-socorro01-master02)
* SSH access from socorro1.stage to tp-socorro01-master01 and tp-socorro01-master02
* (appear to also already have SSH access from tp-socorro01-master02 to tp-socorro01-master01)

This is for automating (within reason) creation of PostgreSQL replicas.

I have a working script to do all this on socorro1.dev.db tested with sdeckelmann's SSH keys (output pasted below).

Paste of sample script output below:

[sdeckelmann@socorro1.dev.db.phx1 ~]$ ./create_replica.sh 
Wed Dec 26 15:54:52 PST 2012
### Make a backup! ###
#### Create directories for socorro1.dev.db.phx1.mozilla.com ####
#### Stop important things on socorro1.dev.db.phx1.mozilla.com ####
Stopping postgresql-pgslave service: [  OK  ]
#### Backup config ####
cp: cannot stat `/pgdata/pgslave/9.2/data/*.conf': No such file or directory
#### Destructively cleaning up replica socorro1.dev.db.phx1.mozilla.com ####
#### Fork a process to ssh to socorro1.dev.db.phx1.mozilla.com and start a netcat ####
/pgdata/pgslave/9.2
#### Making base backup ####
could not change directory to "/home/sdeckelmann"
ERROR:  a backup is already in progress
HINT:  Run pg_stop_backup() and try again.
#### Copy data from tp-socorro01-master02 to socorro1.dev.db.phx1.mozilla.com via netcat ####
/bin/tar: data/base/16408/1884172: file changed as we read it
/bin/tar: data/base/16408/1884152: file changed as we read it
/bin/tar: data/base/16408/1884162: file changed as we read it
/bin/tar: data/base/16408/1884182: file changed as we read it
/bin/tar: data/base/16408/1884169: file changed as we read it
/bin/tar: data/base/16408/1884166: file changed as we read it
/bin/tar: data/base/16408/1884165: file changed as we read it
/bin/tar: data/base/16408: file changed as we read it
/bin/tar: data/pg_xlog/0000001C00000F1C000000E1: file changed as we read it
/bin/tar: data/pg_xlog: file changed as we read it

real	108m1.310s
user	0m0.017s
sys	0m0.011s
#### Stopping base backup ####
could not change directory to "/home/sdeckelmann"
 pg_stop_backup 
----------------
 F1C/E1F4E038
(1 row)

NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
#### Fork a process to ssh to socorro1.dev.db.phx1.mozilla.com and start a netcat for /var/lib/pgsql/wal_archive ####
/var/lib/pgsql/wal_archive
#### Copy data from dir /var/lib/pgsql/wal_archive on tp-socorro01-master02 to socorro1.dev.db.phx1.mozilla.com via netcat ####

real	0m5.258s
user	0m0.015s
sys	0m0.014s
#### Copy config back ####
#### Stop important things on socorro1.dev.db.phx1.mozilla.com ####
Starting postgresql-pgslave service: [  OK  ]
### DONE
Wed Dec 26 17:43:09 PST 2012
The SSH keys should be puppetized FWIW.
Priority: -- → P2
Blocks: 823507
This should be a relatively straightforward setup, similar to an automation key we've created for other similar situations (where one node needs to log into another to do something, without a human's ssh key available).

However, this makes us a little nervous. Are we going about this the right way? CC'ing mpressman and bjohnson in addition to sheeri. Wise DBAs, would you care to weigh in on how best to automate the creation of postgresql replicas?

For starters, that sample output is scary... lots of errors thrown. cd's and cp's not working, and it seems to continue even if it thinks a backup is already in progress. Most worrying is that it seems we're trying to tarball a live/active pgsql directory... I'm not a postgres expert, but that sounds bad... normally you want to lock a DB in some way before backing it up, or else your backups will be inconsistent.

Beyond those errors, the script seems inconsistently coded, like it had multiple authors and wasn't unified. Here's 4 different ways that commands are executed remotely:

$SSH $REPLICA sudo "mkdir -p $REPLICADIR/conf"
echo $CMD | $SSH $REPLICA sudo bash
( $SSH $REPLICA sudo "$CMD" ) &
time $SSH $MASTER sudo "$CMD"
$SSH $REALMASTER sudo "/bin/su postgres -c $( sq "$pgcmd" )"

Note also the use of "sudo su -c". None of this is a problem in and of itself, but it gives me pause. It gives the script a feeling of having been cobbled together.

There's also situations where a job is backgrounded, it sleeps 2 seconds, then continues. This appears to be netcat file transfers. It's probably faster than other methods, but I'm not sure I'm comfortable with this as the standard design. I'd rather see this done via scp or rsync if we can make it work and the time difference isn't massive.


As purely a coding efficiency, there's a large block at the beginning of the file checking for the existence of certain variables, that could be condensed into a loop. Here's an example:

for i in REPLICA MASTER REALMASTER REPLICADIR MASTERDIR WALDIR LC_WALDIR PSQL SUBDIR; do
    if [ "${!i}" = '' ]; then
        echo "FATAL: Did not define $i"; exit 1
    fi
done


Marking this as NEEDINFO from DBAs, to confirm this is the basic process they want to use for automatically making new postgres replicas.
Flags: needinfo?(mpressman)
I defer to Selena and Matt, but I'd love for the replica script to be overhauled and unified.
(In reply to Jake Maul [:jakem] from comment #2)

> Marking this as NEEDINFO from DBAs, to confirm this is the basic process
> they want to use for automatically making new postgres replicas.

Thanks for having a look through the script.

I cobbled this together while our primary replica server was down because no script or tool existed or was documented to do this work at the time.

I've started evaluating the different tools available for creating replicas: https://wiki.postgresql.org/wiki/Binary_Replication_Tools

Hopefully, one of these will meet our needs in the near future and will eliminate the need for a script.  As you'll see if you read through the docs, each tool does slightly different things and not necessarily what we need, so I'm having to test each one pretty carefully before figuring out which will work. :/ Time consuming, and we still need a documented process for restoring a replica, even if it is ugly. 

Here's more specific answers to your questions:

(In reply to Jake Maul [:jakem] from comment #2)

> For starters, that sample output is scary... lots of errors thrown. cd's and
> cp's not working, and it seems to continue even if it thinks a backup is
> already in progress. Most worrying is that it seems we're trying to tarball
> a live/active pgsql directory... I'm not a postgres expert, but that sounds
> bad... normally you want to lock a DB in some way before backing it up, or
> else your backups will be inconsistent.

This is not true of Postgres because of WAL:

Postgres has 2 data storage mechanisms: the $PGDATA/base directory and the Write Ahead Log (WAL). During replication, WAL is transferred either as 16MB files to a filesystem, or as a replication stream. The WAL is a record of everything that's happened in the database in approximately commit order. During a backup, you tell postgres that you're starting the backup, and then you copy the live data directory along with the WAL that's produced during the backup. 

I'm going to hold a brownbag in the near future about backups and Postgres to help explain this in greater detail.

> Here's 4 different ways that commands
> are executed remotely:
> 
> $SSH $REPLICA sudo "mkdir -p $REPLICADIR/conf"
> echo $CMD | $SSH $REPLICA sudo bash
> ( $SSH $REPLICA sudo "$CMD" ) &
> time $SSH $MASTER sudo "$CMD"
> $SSH $REALMASTER sudo "/bin/su postgres -c $( sq "$pgcmd" )"

Yeah, sorry. :(  I'll take a pass at it this afternoon and clean it up.

> There's also situations where a job is backgrounded, it sleeps 2 seconds,
> then continues. This appears to be netcat file transfers. It's probably
> faster than other methods, but I'm not sure I'm comfortable with this as the
> standard design. I'd rather see this done via scp or rsync if we can make it
> work and the time difference isn't massive.

I'm not sure what the speed difference is. Using netcat was something happening before my time.

> As purely a coding efficiency, there's a large block at the beginning of the
> file checking for the existence of certain variables, that could be
> condensed into a loop. Here's an example:
> 
> for i in REPLICA MASTER REALMASTER REPLICADIR MASTERDIR WALDIR LC_WALDIR
> PSQL SUBDIR; do
>     if [ "${!i}" = '' ]; then
>         echo "FATAL: Did not define $i"; exit 1
>     fi
> done

Thanks.
Priority: P2 → P3
Whiteboard: [triaged 20130104]
Flags: needinfo?(mpressman)
Blocks: 826930
In bug 826930, I provided a heavily revised script that uses pg_basebackup instead of the cobbled-together-with-netcat solution.  We'll test that for speed to see if it can be a reasonable replacement.
Do you need the ssh key in place today before you test that script in bug 826930?
Yes I do.
I set this up using pg_basebackup which uses pg_hba.conf and does not require SSH keys.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.